Re: Allow to_date() and to_timestamp() to accept localized names
On Sun, Mar 8, 2020 at 6:03 PM Tom Lane wrote: > James Coleman writes: > > So just to confirm I understand, that implies that the issue is solely > that > > only the utf8 tr_TR set is installed by default on this machine, and the > > iso-8859-9 set is a hard requirement (that is, the test is explicitly > > testing a codepath that generates utf8 results from a non-utf8 source)? > > It's not "explicitly" testing that; the fact that "tr_TR" is treated > as "tr_TR.iso88599" is surely an implementation artifact. (On macOS, > some experimentation shows that "tr_TR" is treated as "tr_TR.UTF-8".) > But yeah, I think it's intentional that we want the codeset translation > path to be exercised here on at least some platforms. > I idly wonder if there could/should be some tests for the implicit case, some explicitly testing the codeset translation (if possible), and some testing the explicit utf8 case...but I don't know enough about this area to push for anything. > > If in fact Ubuntu doesn't install this locale by default, then is this a > > caveat we should add to developer docs somewhere? It seems odd to me that > > I'd be the only one encountering it, but OTOH I would have thought this a > > fairly vanilla install too... > > Not sure. The lack of prior complaints points to this not being a > common situation. It does seem weird that they'd set things up so > that "tr_TR.utf8" exists but not "tr_TR"; even if that's not an > outright packaging mistake, it seems like a POLA violation from here. > > regards, tom lane > All right. I downloaded an Ubuntu 18.04.3 server VM from OSBoxes, and it had very few locales installed by default...so that wasn't all that helpful. I think at this point I'll just leave this as a mystery, much as I hate that. James
Re: Allow to_date() and to_timestamp() to accept localized names
James Coleman writes: > So just to confirm I understand, that implies that the issue is solely that > only the utf8 tr_TR set is installed by default on this machine, and the > iso-8859-9 set is a hard requirement (that is, the test is explicitly > testing a codepath that generates utf8 results from a non-utf8 source)? It's not "explicitly" testing that; the fact that "tr_TR" is treated as "tr_TR.iso88599" is surely an implementation artifact. (On macOS, some experimentation shows that "tr_TR" is treated as "tr_TR.UTF-8".) But yeah, I think it's intentional that we want the codeset translation path to be exercised here on at least some platforms. > If in fact Ubuntu doesn't install this locale by default, then is this a > caveat we should add to developer docs somewhere? It seems odd to me that > I'd be the only one encountering it, but OTOH I would have thought this a > fairly vanilla install too... Not sure. The lack of prior complaints points to this not being a common situation. It does seem weird that they'd set things up so that "tr_TR.utf8" exists but not "tr_TR"; even if that's not an outright packaging mistake, it seems like a POLA violation from here. regards, tom lane
Re: Allow to_date() and to_timestamp() to accept localized names
On Sun, Mar 8, 2020 at 2:19 PM Tom Lane wrote: > I wrote: > > James Coleman writes: > >> I'm still interested in understanding why we're using the ISO locale > >> instead of the utf8 one in a utf8-labeled test though. > > > We are not. My understanding of the rules about this is that the > > active LC_CTYPE setting determines the encoding that libc uses, > > period. The encoding suffix on the locale name only makes a > > difference when LC_CTYPE is being specified (or derived from LANG or > > LC_ALL), not any other LC_XXX setting --- although for consistency > > they'll let you include it in any LC_XXX value. > > Oh wait --- I'm wrong about that. Looking at the code in pg_locale.c, > what actually happens is that we get data in the codeset implied by > the LC_TIME setting and then translate it to the database encoding > (cf commit 7ad1cd31b). So if bare "tr_TR" is taken as implying > iso-8859-9, which seems likely (it appears to work that way here, > anyway) then this test is exercising the codeset translation path. > We could change the test to say 'tr_TR.utf8' but that would give us > less test coverage. > So just to confirm I understand, that implies that the issue is solely that only the utf8 tr_TR set is installed by default on this machine, and the iso-8859-9 set is a hard requirement (that is, the test is explicitly testing a codepath that generates utf8 results from a non-utf8 source)? If so, I'm going to try a bare Ubuntu install on a VM and see what locales are installed by default for Turkish. If in fact Ubuntu doesn't install this locale by default, then is this a caveat we should add to developer docs somewhere? It seems odd to me that I'd be the only one encountering it, but OTOH I would have thought this a fairly vanilla install too... James
Re: Allow to_date() and to_timestamp() to accept localized names
I wrote: > James Coleman writes: >> I'm still interested in understanding why we're using the ISO locale >> instead of the utf8 one in a utf8-labeled test though. > We are not. My understanding of the rules about this is that the > active LC_CTYPE setting determines the encoding that libc uses, > period. The encoding suffix on the locale name only makes a > difference when LC_CTYPE is being specified (or derived from LANG or > LC_ALL), not any other LC_XXX setting --- although for consistency > they'll let you include it in any LC_XXX value. Oh wait --- I'm wrong about that. Looking at the code in pg_locale.c, what actually happens is that we get data in the codeset implied by the LC_TIME setting and then translate it to the database encoding (cf commit 7ad1cd31b). So if bare "tr_TR" is taken as implying iso-8859-9, which seems likely (it appears to work that way here, anyway) then this test is exercising the codeset translation path. We could change the test to say 'tr_TR.utf8' but that would give us less test coverage. regards, tom lane
Re: Allow to_date() and to_timestamp() to accept localized names
James Coleman writes: > Since sudo dpkg-reconfigure locales opens up an ncurses gui on my machine, > I tried selecting the tr_TR.ISO-8859-9 option there and removed the > /var/lib/locales/supported.d/local file. Now I get: > $ locale -a | grep tr_TR > tr_TR > tr_TR.iso88599 > tr_TR.utf8 > And now `make check` passes. Hm. > I'm still interested in understanding why we're using the ISO locale > instead of the utf8 one in a utf8-labeled test though. We are not. My understanding of the rules about this is that the active LC_CTYPE setting determines the encoding that libc uses, period. The encoding suffix on the locale name only makes a difference when LC_CTYPE is being specified (or derived from LANG or LC_ALL), not any other LC_XXX setting --- although for consistency they'll let you include it in any LC_XXX value. We could of course choose to spell LC_TIME as 'tr_TR.utf8' in this test, but that would open up the question of whether that causes problems on platforms where the canonical spelling of the encoding suffix is different (eg "UTF-8"). I'm disinclined to take that risk without positive proof that it helps on some platform. My guess about what was actually happening on your machine is that the underlying locale data only exists in iso-8859-9 form, and that glibc normally translates that to utf8 on-the-fly when it's demanded ... but if the data isn't installed at all, nothing happens. On my Red Hat platforms, this installation choice seems pretty all-or-nothing, but it sounds like Debian lets you chop it up more finely (and maybe slice off your finger while at it :-(). regards, tom lane
Re: Allow to_date() and to_timestamp() to accept localized names
On Sun, Mar 8, 2020 at 10:42 AM James Coleman wrote: > On Sun, Mar 8, 2020 at 7:17 AM Juan José Santamaría Flecha < > juanjo.santama...@gmail.com> wrote: > >> >> >> On Sun, Mar 8, 2020 at 3:48 AM Tom Lane wrote: >> >>> James Coleman writes: >>> > On Sat, Mar 7, 2020 at 9:31 PM Tom Lane wrote: >>> >> Looks like you may not have Turkish locale installed? Try >>> >> locale -a | grep tr_TR >>> >>> > Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I >>> assume the >>> > utf8 version is acceptable? Or is there a non-utf8 variant? >>> >>> Hmm ... I'm far from an expert on the packaging of locale data, but >>> the simplest explanation I can think of is that the tr_TR locale exists >>> to some extent on your machine but the LC_TIME component of that is >>> missing. >>> >> >> AFAICS, the locale 'tr_TR' uses the encoding ISO-8859-9 (LATIN5), is not >> the same as 'tr_TR.utf8'. >> > > The test name implies it's about utf8, though, which makes me wonder if > the test should be testing utf8 instead? > > That being said, a bit more googling based on your node about the proper > ISO encoding turned up this page: https://unix.stackexchange.com/a/446762 > > And I confirmed that the locale you mentioned is available: > $ grep "tr_TR" /usr/share/i18n/SUPPORTED > tr_TR.UTF-8 UTF-8 > tr_TR ISO-8859-9 > > So I tried: > echo tr_TR.ISO-8859-9 >> /var/lib/locales/supported.d/local # In a root > session > sudo dpkg-reconfigure locales > > That didn't seem to fix it, though `locale -a` still only lists > tr_TR.utf8, so I'm still at a loss, and also unclear why a test names utf8 > is actually relying on an ISO encoding. > Another update: Since sudo dpkg-reconfigure locales opens up an ncurses gui on my machine, I tried selecting the tr_TR.ISO-8859-9 option there and removed the /var/lib/locales/supported.d/local file. Now I get: $ locale -a | grep tr_TR tr_TR tr_TR.iso88599 tr_TR.utf8 And now `make check` passes. I'm still interested in understanding why we're using the ISO locale instead of the utf8 one in a utf8-labeled test though. Thanks, James
Re: Allow to_date() and to_timestamp() to accept localized names
On Sun, Mar 8, 2020 at 7:17 AM Juan José Santamaría Flecha < juanjo.santama...@gmail.com> wrote: > > > On Sun, Mar 8, 2020 at 3:48 AM Tom Lane wrote: > >> James Coleman writes: >> > On Sat, Mar 7, 2020 at 9:31 PM Tom Lane wrote: >> >> Looks like you may not have Turkish locale installed? Try >> >> locale -a | grep tr_TR >> >> > Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume >> the >> > utf8 version is acceptable? Or is there a non-utf8 variant? >> >> Hmm ... I'm far from an expert on the packaging of locale data, but >> the simplest explanation I can think of is that the tr_TR locale exists >> to some extent on your machine but the LC_TIME component of that is >> missing. >> > > AFAICS, the locale 'tr_TR' uses the encoding ISO-8859-9 (LATIN5), is not > the same as 'tr_TR.utf8'. > The test name implies it's about utf8, though, which makes me wonder if the test should be testing utf8 instead? That being said, a bit more googling based on your node about the proper ISO encoding turned up this page: https://unix.stackexchange.com/a/446762 And I confirmed that the locale you mentioned is available: $ grep "tr_TR" /usr/share/i18n/SUPPORTED tr_TR.UTF-8 UTF-8 tr_TR ISO-8859-9 So I tried: echo tr_TR.ISO-8859-9 >> /var/lib/locales/supported.d/local # In a root session sudo dpkg-reconfigure locales That didn't seem to fix it, though `locale -a` still only lists tr_TR.utf8, so I'm still at a loss, and also unclear why a test names utf8 is actually relying on an ISO encoding. James
Re: Allow to_date() and to_timestamp() to accept localized names
On Sat, Mar 7, 2020 at 9:48 PM Tom Lane wrote: > James Coleman writes: > > On Sat, Mar 7, 2020 at 9:31 PM Tom Lane wrote: > >> Looks like you may not have Turkish locale installed? Try > >> locale -a | grep tr_TR > > > Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume > the > > utf8 version is acceptable? Or is there a non-utf8 variant? > > Hmm ... I'm far from an expert on the packaging of locale data, but > the simplest explanation I can think of is that the tr_TR locale exists > to some extent on your machine but the LC_TIME component of that is > missing. > > Do you get different results from "date" depending on the locale? > I get > > $ LANG=C date > Sat Mar 7 21:44:24 EST 2020 > $ LANG=tr_TR.utf8 date > Cts Mar 7 21:44:26 EST 2020 > $ LANG=C date Sun Mar 8 09:27:50 EDT 2020 $ LANG=tr_TR.utf8 date Paz Mar 8 09:27:54 EDT 2020 $ LANG=tr_TR date Sun Mar 8 09:27:57 EDT 2020 I'm curious if you get something different for that last one (no utf8 qualifier). on my Fedora 30 box. > > Another possibility perhaps is that you have partial locale settings > in your environment that are bollixing the test. Try > > $ env | grep ^LANG > This gives me: LANG=en_US.UTF-8 LANGUAGE=en_US > $ env | grep ^LC_ > Nothing for this. > If there's more than one relevant environment setting, and they > don't all agree, I'm not sure what would happen with our > regression tests. > There's LANG and LANGUAGE but they look like they effectively agree to me? > BTW, what platform are you using anyway? > ElementaryOS 5.1 Hera, which is built on top of Ubuntu 18.04.3 LTS (and Linux 4.15.0-72-generic). This suggests I have the standard Ubuntu Turkish language packages installed: $ dpkg -l | grep language-pack-tr ii language-pack-tr 1:18.04+20180712 all translation updates for language Turkish ii language-pack-tr-base 1:18.04+20180712 all translations for language Turkish James
Re: Allow to_date() and to_timestamp() to accept localized names
On Sun, Mar 8, 2020 at 3:48 AM Tom Lane wrote: > James Coleman writes: > > On Sat, Mar 7, 2020 at 9:31 PM Tom Lane wrote: > >> Looks like you may not have Turkish locale installed? Try > >> locale -a | grep tr_TR > > > Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume > the > > utf8 version is acceptable? Or is there a non-utf8 variant? > > Hmm ... I'm far from an expert on the packaging of locale data, but > the simplest explanation I can think of is that the tr_TR locale exists > to some extent on your machine but the LC_TIME component of that is > missing. > AFAICS, the locale 'tr_TR' uses the encoding ISO-8859-9 (LATIN5), is not the same as 'tr_TR.utf8'. > BTW, what platform are you using anyway? > I have just checked in a Debian Stretch Regards, Juan José Santamaría Flecha
Re: Allow to_date() and to_timestamp() to accept localized names
James Coleman writes: > On Sat, Mar 7, 2020 at 9:31 PM Tom Lane wrote: >> Looks like you may not have Turkish locale installed? Try >> locale -a | grep tr_TR > Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume the > utf8 version is acceptable? Or is there a non-utf8 variant? Hmm ... I'm far from an expert on the packaging of locale data, but the simplest explanation I can think of is that the tr_TR locale exists to some extent on your machine but the LC_TIME component of that is missing. Do you get different results from "date" depending on the locale? I get $ LANG=C date Sat Mar 7 21:44:24 EST 2020 $ LANG=tr_TR.utf8 date Cts Mar 7 21:44:26 EST 2020 on my Fedora 30 box. Another possibility perhaps is that you have partial locale settings in your environment that are bollixing the test. Try $ env | grep ^LANG $ env | grep ^LC_ If there's more than one relevant environment setting, and they don't all agree, I'm not sure what would happen with our regression tests. BTW, what platform are you using anyway? regards, tom lane
Re: Allow to_date() and to_timestamp() to accept localized names
On Sat, Mar 7, 2020 at 9:31 PM Tom Lane wrote: > James Coleman writes: > > On master with a clean build (and configure re-run) and a fresh init-db, > > I'm seeing the collate.linux.utf8 test fail with the attached diff. > > -- to_char > SET lc_time TO 'tr_TR'; > +ERROR: invalid value for parameter "lc_time": "tr_TR" > SELECT to_char(date '2010-02-01', 'DD TMMON '); > > Looks like you may not have Turkish locale installed? Try > > locale -a | grep tr_TR > > If you don't see "tr_TR.utf8" or some variant spelling of that, > the collate.linux.utf8 test is not gonna pass. The required > package is probably some sub-package of glibc. > > A workaround if you don't want to install more stuff is to run the > regression tests in C locale, so that that test script gets skipped. > > regards, tom lane > Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume the utf8 version is acceptable? Or is there a non-utf8 variant? Thanks, James
Re: Allow to_date() and to_timestamp() to accept localized names
James Coleman writes: > On master with a clean build (and configure re-run) and a fresh init-db, > I'm seeing the collate.linux.utf8 test fail with the attached diff. -- to_char SET lc_time TO 'tr_TR'; +ERROR: invalid value for parameter "lc_time": "tr_TR" SELECT to_char(date '2010-02-01', 'DD TMMON '); Looks like you may not have Turkish locale installed? Try locale -a | grep tr_TR If you don't see "tr_TR.utf8" or some variant spelling of that, the collate.linux.utf8 test is not gonna pass. The required package is probably some sub-package of glibc. A workaround if you don't want to install more stuff is to run the regression tests in C locale, so that that test script gets skipped. regards, tom lane
Re: Allow to_date() and to_timestamp() to accept localized names
On Tue, Mar 3, 2020 at 11:09 AM Tom Lane wrote: > =?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= < > juanjo.santama...@gmail.com> writes: > > On Sun, Mar 1, 2020 at 11:25 PM Tom Lane wrote: > >> Going once, going twice ... > > > You have moved this to better place, so none from me, and thank you. > > Pushed then. > On master with a clean build (and configure re-run) and a fresh init-db, I'm seeing the collate.linux.utf8 test fail with the attached diff. Is there something I need to reconfigure, install on my machine (elementary os, so based on Ubuntu) for this to pass? Thanks, James regression.diffs Description: Binary data
Re: Allow to_date() and to_timestamp() to accept localized names
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= writes: > On Sun, Mar 1, 2020 at 11:25 PM Tom Lane wrote: >> Going once, going twice ... > You have moved this to better place, so none from me, and thank you. Pushed then. While looking over the patch one more time, I noticed some pretty poor English in docs and error messages for the related jsonb code, so I took it on myself to fix that while at it. regards, tom lane
Re: Allow to_date() and to_timestamp() to accept localized names
On Sun, Mar 1, 2020 at 11:25 PM Tom Lane wrote: > > > Barring objections, this seems > > committable to me. > > Going once, going twice ... > You have moved this to better place, so none from me, and thank you. Regards, Juan José Santamaría Flecha
Re: Allow to_date() and to_timestamp() to accept localized names
I wrote: > * I don't think it's appropriate to hard-wire DEFAULT_COLLATION_OID > as the collation to do case-folding with. For to_date/to_timestamp, > we have collatable text input so we can rely on the function's input > collation instead, at the cost of having to pass down the collation > OID through a few layers of subroutines :-(. For parse_datetime, > I punted for now and let it use DEFAULT_COLLATION_OID, because that's > currently only called by JSONB code that probably hasn't got a usable > input collation anyway (since jsonb isn't considered collatable). On closer look, it's probably a wise idea to change the signature of parse_datetime() to include a collation argument, because that function is new in v13 so there's no API-break argument against it. It will never be cheaper to change it than today. So v11 below does that, pushing the use of DEFAULT_COLLATION_OID into the json-specific code. Perhaps somebody else would like to look at whether there's something brighter for that code to do, but I still suspect there isn't, so I didn't chase it further. > Barring objections, this seems > committable to me. Going once, going twice ... regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 28035f1..8b73e05 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -5968,7 +5968,7 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); TM prefix -translation mode (print localized day and month names based on +translation mode (use localized day and month names based on ) TMMonth @@ -5999,9 +5999,20 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); - TM does not include trailing blanks. - to_timestamp and to_date ignore - the TM modifier. + TM suppresses trailing blanks whether or + not FM is specified. + + + + + + to_timestamp and to_date + ignore letter case in the input; so for + example MON, Mon, + and mon all accept the same strings. When using + the TM modifier, case-folding is done according to + the rules of the function's input collation (see + ). diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index d029468..95f7d05 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -1038,7 +1038,7 @@ static void parse_format(FormatNode *node, const char *str, const KeyWord *kw, static void DCH_to_char(FormatNode *node, bool is_interval, TmToChar *in, char *out, Oid collid); static void DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, - bool std, bool *have_error); + Oid collid, bool std, bool *have_error); #ifdef DEBUG_TO_FROM_CHAR static void dump_index(const KeyWord *k, const int *index); @@ -1057,11 +1057,14 @@ static int from_char_parse_int_len(int *dest, const char **src, const int len, FormatNode *node, bool *have_error); static int from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_error); -static int seq_search(const char *name, const char *const *array, int *len); +static int seq_search_ascii(const char *name, const char *const *array, int *len); +static int seq_search_localized(const char *name, char **array, int *len, + Oid collid); static int from_char_seq_search(int *dest, const char **src, const char *const *array, + char **localized_array, Oid collid, FormatNode *node, bool *have_error); -static void do_to_timestamp(text *date_txt, text *fmt, bool std, +static void do_to_timestamp(text *date_txt, text *fmt, Oid collid, bool std, struct pg_tm *tm, fsec_t *fsec, int *fprec, uint32 *flags, bool *have_error); static char *fill_str(char *str, int c, int max); @@ -2457,7 +2460,7 @@ from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_er * suitable for comparisons to ASCII strings. */ static int -seq_search(const char *name, const char *const *array, int *len) +seq_search_ascii(const char *name, const char *const *array, int *len) { unsigned char firstc; const char *const *a; @@ -2503,8 +2506,89 @@ seq_search(const char *name, const char *const *array, int *len) } /* - * Perform a sequential search in 'array' for an entry matching the first - * character(s) of the 'src' string case-insensitively. + * Sequentially search an array of possibly non-English words for + * a case-insensitive match to the initial character(s) of "name". + * + * This has the same API as seq_search_ascii(), but we use a more general + * case-folding transformation to achieve case-insensitivity. Case folding + * is done per the rules of the collation identified by "collid". + * + * The array is treated as const, but we don't declare it that way because + * the arrays exported by p
Re: Allow to_date() and to_timestamp() to accept localized names
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= writes: > On Tue, Jan 28, 2020 at 9:35 PM Peter Eisentraut < > peter.eisentr...@2ndquadrant.com> wrote: >> So I'm withdrawing my concerns with respect to this patch. As long as >> it can do a roundtrip conversion with to_char(), it's fine. > We can avoid issues with non injective case conversion languages with a > double conversion, so both strings in the comparison end up in the same > state. > I propose an upper/lower conversion as in the attached patch. This seems pretty reasonable to me, with a couple of caveats: * I don't think it's appropriate to hard-wire DEFAULT_COLLATION_OID as the collation to do case-folding with. For to_date/to_timestamp, we have collatable text input so we can rely on the function's input collation instead, at the cost of having to pass down the collation OID through a few layers of subroutines :-(. For parse_datetime, I punted for now and let it use DEFAULT_COLLATION_OID, because that's currently only called by JSONB code that probably hasn't got a usable input collation anyway (since jsonb isn't considered collatable). * I think it's likely worthwhile to do a quick check for an exact match before we go through all these case-folding pushups. If the expected use-case is reversing to_char() output, that will win all the time. We're probably ahead of the game even if it only matches a few percent of the time. Attached v10 incorporates those improvements, plus a bit of polishing of docs and comments. Barring objections, this seems committable to me. regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 28035f1..8b73e05 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -5968,7 +5968,7 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); TM prefix -translation mode (print localized day and month names based on +translation mode (use localized day and month names based on ) TMMonth @@ -5999,9 +5999,20 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); - TM does not include trailing blanks. - to_timestamp and to_date ignore - the TM modifier. + TM suppresses trailing blanks whether or + not FM is specified. + + + + + + to_timestamp and to_date + ignore letter case in the input; so for + example MON, Mon, + and mon all accept the same strings. When using + the TM modifier, case-folding is done according to + the rules of the function's input collation (see + ). diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index d029468..db99a6b 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -1038,7 +1038,7 @@ static void parse_format(FormatNode *node, const char *str, const KeyWord *kw, static void DCH_to_char(FormatNode *node, bool is_interval, TmToChar *in, char *out, Oid collid); static void DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, - bool std, bool *have_error); + Oid collid, bool std, bool *have_error); #ifdef DEBUG_TO_FROM_CHAR static void dump_index(const KeyWord *k, const int *index); @@ -1057,11 +1057,14 @@ static int from_char_parse_int_len(int *dest, const char **src, const int len, FormatNode *node, bool *have_error); static int from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_error); -static int seq_search(const char *name, const char *const *array, int *len); +static int seq_search_ascii(const char *name, const char *const *array, int *len); +static int seq_search_localized(const char *name, char **array, int *len, + Oid collid); static int from_char_seq_search(int *dest, const char **src, const char *const *array, + char **localized_array, Oid collid, FormatNode *node, bool *have_error); -static void do_to_timestamp(text *date_txt, text *fmt, bool std, +static void do_to_timestamp(text *date_txt, text *fmt, Oid collid, bool std, struct pg_tm *tm, fsec_t *fsec, int *fprec, uint32 *flags, bool *have_error); static char *fill_str(char *str, int c, int max); @@ -2457,7 +2460,7 @@ from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_er * suitable for comparisons to ASCII strings. */ static int -seq_search(const char *name, const char *const *array, int *len) +seq_search_ascii(const char *name, const char *const *array, int *len) { unsigned char firstc; const char *const *a; @@ -2503,8 +2506,89 @@ seq_search(const char *name, const char *const *array, int *len) } /* - * Perform a sequential search in 'array' for an entry matching the first - * character(s) of the 'src' string case-insensitively. + * Sequentially search an array of possibly non-English words for
Re: Allow to_date() and to_timestamp() to accept localized names
On Tue, Jan 28, 2020 at 9:35 PM Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 2020-01-28 16:47, Juan José Santamaría Flecha wrote: > > This patch targets to do something symmetrical to to_char(), which will > > just return a single value. > > I didn't fully realize while reading this thread that to_char() already > supports localized output and this patch indeed just wants to do the > opposite. > > So I'm withdrawing my concerns with respect to this patch. As long as > it can do a roundtrip conversion with to_char(), it's fine. > > We can avoid issues with non injective case conversion languages with a double conversion, so both strings in the comparison end up in the same state. I propose an upper/lower conversion as in the attached patch. Regards, Juan José Santamaría Flecha diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index ceda48e..b1951e5 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -5968,7 +5968,7 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); TM prefix -translation mode (print localized day and month names based on +translation mode (use localized day and month names based on ) TMMonth @@ -6000,8 +6000,13 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); TM does not include trailing blanks. + + + + + to_timestamp and to_date ignore - the TM modifier. + the case when receiving names as an input. diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index f58331d..e5b4eb5 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -1059,9 +1059,11 @@ static int from_char_parse_int_len(int *dest, const char **src, const int len, FormatNode *node, bool *have_error); static int from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_error); -static int seq_search(const char *name, const char *const *array, int *len); +static int seq_search_ascii(const char *name, const char *const *array, int *len); +static int seq_search_localized(const char *name, char **array, int *len); static int from_char_seq_search(int *dest, const char **src, const char *const *array, + char **localized_array, FormatNode *node, bool *have_error); static void do_to_timestamp(text *date_txt, text *fmt, bool std, struct pg_tm *tm, fsec_t *fsec, int *fprec, @@ -2459,7 +2461,7 @@ from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_er * suitable for comparisons to ASCII strings. */ static int -seq_search(const char *name, const char *const *array, int *len) +seq_search_ascii(const char *name, const char *const *array, int *len) { unsigned char firstc; const char *const *a; @@ -2505,8 +2507,74 @@ seq_search(const char *name, const char *const *array, int *len) } /* - * Perform a sequential search in 'array' for an entry matching the first - * character(s) of the 'src' string case-insensitively. + * Sequentially search an array of possibly non-English words for + * a case-insensitive match to the initial character(s) of "name". + * + * This has the same API as seq_search_ascii(), but we use a more general + * downcasing transformation to achieve case-insensitivity. + * + * The array is treated as const, but we don't declare it that way because + * the arrays exported by pg_locale.c aren't const. + */ +static int +seq_search_localized(const char *name, char **array, int *len) +{ + char **a; + char *lower_name; + char *upper_name; + + *len = 0; + + /* empty string can't match anything */ + if (!*name) + return -1; + + /* + * We do an upper/lower conversion to avoid problems with languages + * in which case conversions are not injective. + */ + upper_name = str_toupper(unconstify(char *, name), strlen(name), + DEFAULT_COLLATION_OID); + lower_name = str_tolower(upper_name, strlen(upper_name), + DEFAULT_COLLATION_OID); + pfree(upper_name); + + for (a = array; *a != NULL; a++) + { + char *lower_element; + char *upper_element; + int element_len; + + /* Upper/lower-case array element, assuming it is normalized */ + upper_element = str_toupper(*a, strlen(*a), DEFAULT_COLLATION_OID); + lower_element = str_tolower(upper_element, strlen(upper_element), + DEFAULT_COLLATION_OID); + pfree(upper_element); + element_len = strlen(lower_element); + + /* Match? */ + if (strncmp(lower_name, lower_element, element_len) == 0) + { + *len = element_len; + pfree(lower_element); + pfree(lower_name); + return a - array; + } + pfree(lower_element); + } + + pfree(lower_name); + return -1; +} + +/* + * Perform a sequential search in 'array' (or 'localized_array', if that's + * not NULL) for an entry matching the first character(s) of the 'src' + * string case-insensitively. + * +
Re: Allow to_date() and to_timestamp() to accept localized names
On Tue, Jan 28, 2020 at 5:21 PM Alvaro Herrera wrote: > On 2020-Jan-28, Peter Eisentraut wrote: > > > On 2020-01-28 04:05, Mark Dilger wrote: > > > German uses both Sonnabend and Samstag for Saturday, so don’t you have > to compare to a list of values anyway? > > > > Yeah, good point. If it doesn't accept both "Sonnabend" and "Samstag", > then > > it's not really usable. > > The string "Sonnabend" never appears in the glibc sources, so that will > certainly not work. I vote not to care about that, but of course my > language is not one that has alternate weekday/month names. I guess if > we're intent on recognizing alternate names, we'll have to build our own > list of them :-( > > I don't have the ICU sources here to check the status there. > > "Sonnabend" is neither available in ICU. What is available are both genitive and nominative forms for months, as reported up thread by Peter. See formats "M" and "L" in: http://userguide.icu-project.org/formatparse/datetime Regards, Juan José Santamaría Flecha
Re: Allow to_date() and to_timestamp() to accept localized names
On 2020-01-28 16:47, Juan José Santamaría Flecha wrote: This patch targets to do something symmetrical to to_char(), which will just return a single value. I didn't fully realize while reading this thread that to_char() already supports localized output and this patch indeed just wants to do the opposite. So I'm withdrawing my concerns with respect to this patch. As long as it can do a roundtrip conversion with to_char(), it's fine. It's pretty clear that this interface cannot be useful for producing or parsing fully general localized dates. But since it exists now (and it's quasi SQL standard now), we might as well fill in this gap. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow to_date() and to_timestamp() to accept localized names
Mark Dilger writes: >> On Jan 28, 2020, at 9:30 AM, Tom Lane wrote: >> A brute-force answer, if there are few enough cases, is to recognize >> them regardless of the specific value of LC_TIME. Do we think >> anybody would notice or care if to_date('Sonnabend', 'TMDay') works >> even when in a non-German locale? > I think this only becomes a problem if there are weekday or month name > collisions between languages where they have different meanings. As an > absurd hypothetical, if “Sunday” in Tagalog means what “Tuesday” means in > English, then you’ve got a problem. > This does happen for month abbreviations. “Mar” is short for “March” and > variations in a number of languages, but short for “November” in Finnish. > For day abbreviations, “Su” collides between fi_FI and hr_HR, and “tor” > collides between sl_SL and no_NO. But none of those cases are alternative names, so we wouldn't have entries for them in this hypothetical list. They'd only be recognized when strftime() returns them. I suspect also that the abbreviated names have few if any alternatives, so we may only need this whole hack for full names. A possible way of tightening things up without explicitly decoding the locale name is to make the entries of the list be string pairs: "if strftime() returned this, then also consider that". That puts a bit of a premium on knowing which names strftime considers primary, but I think we'd have had to know that anyway. regards, tom lane
Re: Allow to_date() and to_timestamp() to accept localized names
> On Jan 28, 2020, at 9:30 AM, Tom Lane wrote: > > A brute-force answer, if there are few enough cases, is to recognize > them regardless of the specific value of LC_TIME. Do we think > anybody would notice or care if to_date('Sonnabend', 'TMDay') works > even when in a non-German locale? I think this only becomes a problem if there are weekday or month name collisions between languages where they have different meanings. As an absurd hypothetical, if “Sunday” in Tagalog means what “Tuesday” means in English, then you’ve got a problem. This does happen for month abbreviations. “Mar” is short for “March” and variations in a number of languages, but short for “November” in Finnish. For day abbreviations, “Su” collides between fi_FI and hr_HR, and “tor” collides between sl_SL and no_NO. I don’t see any collisions with full month or full weekday names, but I’m also only looking at the 53 locales installed in /usr/share/locale/.*/LC_TIME on my mac. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Allow to_date() and to_timestamp() to accept localized names
Mark Dilger writes: > But then the manual page goes on to say: >> %E* %O* >> POSIX locale extensions. The sequences %Ec %EC %Ex %EX %Ey %EY %Od %Oe %OH >> %OI %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy are supposed to provide alternate >> representations. >> >> Additionally %OB implemented to represent alternative months names (used >> standalone, without day mentioned). > This is the part I haven’t played with, but it sounds like it can handle at > least one alternate name. Perhaps you can get the alternates this way? This sounded promising, but the POSIX strftime spec doesn't mention %OB, so I'm afraid we can't count on it to do much. At this point I'm not really convinced that there are no languages with more than two forms, anyway :-(. I also wondered whether we could get any further by using strptime() to convert localized month and day names on-the-fly, rather than the patch's current approach of re-using strftime() results. If strptime() fails to support alternative names, it's their bug not ours. Unfortunately, glibc has got said bug (AFAICS anyway), so in practice this would only offer us plausible deniability and not much of any real functionality. In the end it seems like we could only handle alternative names by keeping our own lists of them. There are probably few enough cases that that wouldn't be a tremendous maintenance problem, but what I'm not quite seeing is how we'd decide which list to use when. Right now, locale identifiers are pretty much opaque to us ... do we really want to get into the business of recognizing that such a name refers to German, or Greek? A brute-force answer, if there are few enough cases, is to recognize them regardless of the specific value of LC_TIME. Do we think anybody would notice or care if to_date('Sonnabend', 'TMDay') works even when in a non-German locale? regards, tom lane
Re: Allow to_date() and to_timestamp() to accept localized names
On 2020-Jan-28, Peter Eisentraut wrote: > On 2020-01-28 04:05, Mark Dilger wrote: > > German uses both Sonnabend and Samstag for Saturday, so don’t you have to > > compare to a list of values anyway? > > Yeah, good point. If it doesn't accept both "Sonnabend" and "Samstag", then > it's not really usable. The string "Sonnabend" never appears in the glibc sources, so that will certainly not work. I vote not to care about that, but of course my language is not one that has alternate weekday/month names. I guess if we're intent on recognizing alternate names, we'll have to build our own list of them :-( I don't have the ICU sources here to check the status there. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow to_date() and to_timestamp() to accept localized names
> On Jan 28, 2020, at 7:47 AM, Juan José Santamaría Flecha > wrote: > > This looks like a POSIX feature that some systems might not like (Windows > [1]). But if this is something that the patch should aim to, I am fine with a > RWF and give it another try in the future. As long as this implementation doesn’t create a backward-compatibility problem when doing a more complete implementation later, I’m fine with this patch not tackling the whole problem. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Allow to_date() and to_timestamp() to accept localized names
On Tue, Jan 28, 2020 at 4:06 PM Mark Dilger wrote: > > I’m not insisting, just asking about the design. If it only works with > one name supported per weekday per language, and per month per language, > I’m not going to object. I was just curious if you were going to support > multiple names anyway, and if that made the question about the Greek case > folding less pressing. > > This patch targets to do something symmetrical to to_char(), which will just return a single value. The issue with the greek locale is that it cannot hold this equivalent behaviour, as in: postgres=# select to_date(to_char(now(), 'TMMONTH'), 'TMMONTH'); ERROR: invalid value "ΙΑΝΟΥΆΡΙΟΣ" for "MONTH" Because of the particular behaviour sigma (Σσς) casing has, which is also user visible with a lower(). > > > %E* %O* > >POSIX locale extensions. The sequences %Ec %EC %Ex %EX %Ey > %EY %Od %Oe %OH %OI %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy are supposed to > provide alternate representations. > > > >Additionally %OB implemented to represent alternative months > names (used standalone, without day mentioned). > > This is the part I haven’t played with, but it sounds like it can handle > at least one alternate name. Perhaps you can get the alternates this way? > > This looks like a POSIX feature that some systems might not like (Windows [1]). But if this is something that the patch should aim to, I am fine with a RWF and give it another try in the future. [1] https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/strftime-wcsftime-strftime-l-wcsftime-l?view=vs-2019 Regards, Juan José Santamaría Flecha
Re: Allow to_date() and to_timestamp() to accept localized names
> On Jan 28, 2020, at 6:51 AM, Tom Lane wrote: > > Peter Eisentraut writes: >> On 2020-01-28 04:05, Mark Dilger wrote: >>> German uses both Sonnabend and Samstag for Saturday, so don’t you have to >>> compare to a list of values anyway? > >> Yeah, good point. If it doesn't accept both "Sonnabend" and "Samstag", >> then it's not really usable. > > If we're going to insist on that, then the entire patch is junk. I’m not insisting, just asking about the design. If it only works with one name supported per weekday per language, and per month per language, I’m not going to object. I was just curious if you were going to support multiple names anyway, and if that made the question about the Greek case folding less pressing. > Indeed, I don't even know where we could get the knowledge of which > name(s) to accept, because strftime is surely only going to tell us > one of them. I haven’t played with this yet, but per the manual page for strftime_l: > %Ais replaced by national representation of the full weekday name. > > %ais replaced by national representation of the abbreviated weekday > name. > > %Bis replaced by national representation of the full month name. > > %bis replaced by national representation of the abbreviated month > name. Which I think gives you just the preferred name, by whatever means the library decides which of Sonnabend/Samstag is the preferred name. But then the manual page goes on to say: > %E* %O* >POSIX locale extensions. The sequences %Ec %EC %Ex %EX %Ey %EY > %Od %Oe %OH %OI %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy are supposed to provide > alternate representations. > >Additionally %OB implemented to represent alternative months names > (used standalone, without day mentioned). This is the part I haven’t played with, but it sounds like it can handle at least one alternate name. Perhaps you can get the alternates this way? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Allow to_date() and to_timestamp() to accept localized names
Peter Eisentraut writes: > On 2020-01-28 04:05, Mark Dilger wrote: >> German uses both Sonnabend and Samstag for Saturday, so don’t you have to >> compare to a list of values anyway? > Yeah, good point. If it doesn't accept both "Sonnabend" and "Samstag", > then it's not really usable. If we're going to insist on that, then the entire patch is junk. Indeed, I don't even know where we could get the knowledge of which name(s) to accept, because strftime is surely only going to tell us one of them. regards, tom lane
Re: Allow to_date() and to_timestamp() to accept localized names
On 2020-01-28 03:11, Tom Lane wrote: The other question your example raises is whether we should be trying to de-accent before comparison, ie was it right for 'Ιανουάριος' to be treated differently from 'Ιανουαριος'. I don't know enough Greek to say, but it kind of feels like that should be outside to_date's purview. To clarify, nothing in my examples was meant to imply that de-accenting might be necessary. AFAICT, the accented forms are the only linguistically acceptable forms and all the locale libraries accept them correctly in their accented forms. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow to_date() and to_timestamp() to accept localized names
On 2020-01-28 04:05, Mark Dilger wrote: German uses both Sonnabend and Samstag for Saturday, so don’t you have to compare to a list of values anyway? Yeah, good point. If it doesn't accept both "Sonnabend" and "Samstag", then it's not really usable. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow to_date() and to_timestamp() to accept localized names
> On Jan 27, 2020, at 6:11 PM, Tom Lane wrote: > > and I think it > makes German better (does 'ß' appear in any month/day names there?), > so maybe we should just roll with that. To my knowledge, neither /ß/ nor /ss/ occur in day or month names in German, neither presently nor in recent times, so I wouldn’t expect it to matter for German whether you use uppercase or lowercase. > > The other question your example raises is whether we should be trying > to de-accent before comparison, ie was it right for 'Ιανουάριος' to > be treated differently from 'Ιανουαριος'. I don't know enough Greek > to say, but it kind of feels like that should be outside to_date's > purview. I’m getting a little lost here. German uses both Sonnabend and Samstag for Saturday, so don’t you have to compare to a list of values anyway? I don’t know Norwegian, but Wikipedia shows both sundag and søndag for Sunday in Norwegian Nynorsk. Faroese seems to have a few similar cases. Whether those languages have alternate day names both in common usage, I can’t say, but both Sonnabend and Samstag are still in use in the German speaking world. If you need to compare against lists, then I would think you could put both ιανουάριοσ and ιανουάριος into a list, even if only one of them is grammatically correct Greek. I’d think that unaccented versions of names might also go into the list, depending on whether speakers of that language consider them equivalent. That sounds like something for the translators to hash out. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Allow to_date() and to_timestamp() to accept localized names
Peter Eisentraut writes: > For the record, the correct form of that would appear to be > select to_date('Ιανουάριος', 'TMMonth'); > with the accent. I had tried different variations of that and they all > failed. OK, so for anyone who is as confused as I was, the main point here seems to be this: the upper case form of Greek sigma is 'Σ', and the lower case form is 'σ' ... except as the final letter of a word, where it is supposed to be written like 'ς'. If I set lc_collate, lc_ctype, and lc_time to 'el_GR.utf8', then (on a somewhat hoary glibc platform) I get u8=# select to_char('2020-01-01'::timestamptz, 'TMMONTH'); to_char -- ΙΑΝΟΥΆΡΙΟΣ (1 row) u8=# select to_char('2020-01-01'::timestamptz, 'TMMonth'); to_char -- Ιανουάριος (1 row) u8=# select to_char('2020-01-01'::timestamptz, 'TMmonth'); to_char -- ιανουάριος (1 row) which is correct AFAICS ... but u8=# select lower(to_char('2020-01-01'::timestamptz, 'TMMONTH')); lower -- ιανουάριοσ (1 row) So what we actually have here, ISTM, is a bug in lower() not to_char(). The bug is unsurprising because str_tolower() simply applies towlower_l() to each character independently, so there's no way for it to account for the word-final rule. I'm not aware that glibc provides any API whereby that could be done correctly. On the other hand, we get it right when using an ICU collation for lower(): u8=# select lower(to_char('2020-01-01'::timestamptz, 'TMMONTH') collate "el-gr-x-icu"); lower -- ιανουάριος (1 row) because that code path passes the whole string to ICU at once, and of course getting this right is ICU's entire job. I haven't double-checked, but I imagine that the reason that to_char gets the month name case-folding right is that what comes out of strftime(..."%B"...) is "Ιανουάριος" which we are able to upcase correctly, while the downcasing code paths don't affect 'ς'. I thought for a little bit about trying to dodge this issue in the patch by folding to upper case, not lower, before comparing month/day names. I fear that that would just shift the problem cases to some other language(s). However, it makes Greek better, and I think it makes German better (does 'ß' appear in any month/day names there?), so maybe we should just roll with that. In the end, it doesn't seem right to reject this patch just because lower() is broken on some platforms. The other question your example raises is whether we should be trying to de-accent before comparison, ie was it right for 'Ιανουάριος' to be treated differently from 'Ιανουαριος'. I don't know enough Greek to say, but it kind of feels like that should be outside to_date's purview. regards, tom lane
Re: Allow to_date() and to_timestamp() to accept localized names
On 2020-01-24 19:01, Peter Eisentraut wrote: postgres=# select to_char(now(),'TMmonth'); to_char ιανουαρίου (1 row) which is the genitive of ιανουάριος. You use the genitive form for a date (24th of January) but the nominative otherwise. But the reverse mapping can only take one of these forms. So here select to_date('Ιανουαριος', 'TMMonth'); fails, which is bad. For the record, the correct form of that would appear to be select to_date('Ιανουάριος', 'TMMonth'); with the accent. I had tried different variations of that and they all failed. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow to_date() and to_timestamp() to accept localized names
On 2020-01-24 18:25, Juan José Santamaría Flecha wrote: To illustrate the issue, it does not work as expected: postgres=# select lower(to_char(now(),'TMMONTH')); lower ιανουάριοσ (1 row) postgres=# select to_char(now(),'TMmonth'); to_char ιανουάριος (1 row) Well, this is interesting, because on macOS and Debian stable I get postgres=# select to_char(now(),'TMmonth'); to_char ιανουαρίου (1 row) which is the genitive of ιανουάριος. You use the genitive form for a date (24th of January) but the nominative otherwise. But the reverse mapping can only take one of these forms. So here select to_date('Ιανουαριος', 'TMMonth'); fails, which is bad. In the glibc locale data sources they have both forms listed, but apparently the API were are using only accepts one of them. (I don't know any Greek, I'm just extrapolating from Wikipedia and locale data.) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow to_date() and to_timestamp() to accept localized names
On Fri, Jan 24, 2020 at 5:46 PM Tom Lane wrote: > Peter Eisentraut writes: > > Looking through the patch quickly, if you want to get Unicode-fancy, > > doing a case-insensitive comparison by running lower-case on both > > strings is also wrong in corner cases. All the Greek month names end in > > sigma, so I suspect that this patch might not work correctly in such > cases. > > Hm. That's basically what citext does, and I don't recall hearing > complaints about that. What other definition of "case insensitive" > would you suggest? > > To illustrate the issue, it does not work as expected: postgres=# select lower(to_char(now(),'TMMONTH')); lower ιανουάριοσ (1 row) postgres=# select to_char(now(),'TMmonth'); to_char ιανουάριος (1 row) Regards, Juan José Santamaría Flecha
Re: Allow to_date() and to_timestamp() to accept localized names
Peter Eisentraut writes: > Looking through the patch quickly, if you want to get Unicode-fancy, > doing a case-insensitive comparison by running lower-case on both > strings is also wrong in corner cases. All the Greek month names end in > sigma, so I suspect that this patch might not work correctly in such cases. Hm. That's basically what citext does, and I don't recall hearing complaints about that. What other definition of "case insensitive" would you suggest? regards, tom lane
Re: Allow to_date() and to_timestamp() to accept localized names
On 2020-01-24 17:22, Tom Lane wrote: Alvaro Herrera writes: But that's a different POV. The input to this function could come from arbitrary user input from any application whatsoever. So the only reason we can get away with that is because the example regression case Juan José added (which uses non-normals) does not conform to the standard. I'm unsure about "conforming to standard", but I think it's reasonable to put the onus of doing normalization when necessary on the user. Otherwise, we need to move normalization logic into basically all the string processing functions (even texteq), which seems like a pretty huge cost that will benefit only a small minority of people. (If it's not a small minority, then where's the bug reports complaining that we don't do it today?) These reports do exist, and this behavior is known. However, the impact is mostly that results "look wrong" (looks the same but doesn't compare as equal) rather than causing inconsistency and corruption, so it's mostly shrugged off. The nondeterministic collation feature was introduced in part to be able to deal with this; the pending normalization patch is another. However, this behavior is baked deeply into Unicode, so no single feature or facility will simply make it go away. AFAICT, we haven't so far had any code that does a lookup of non-ASCII strings in a table, so that's why we haven't had this discussion yet. Now that I think about it, you could also make an argument that this should be handled through collation, so the function that looks up the string in the locale table should go through texteq. However, this would mostly satisfy the purists but create a bizarre user experience. Looking through the patch quickly, if you want to get Unicode-fancy, doing a case-insensitive comparison by running lower-case on both strings is also wrong in corner cases. All the Greek month names end in sigma, so I suspect that this patch might not work correctly in such cases. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow to_date() and to_timestamp() to accept localized names
Alvaro Herrera writes: > But that's a different POV. The input to this function could come from > arbitrary user input from any application whatsoever. So the only > reason we can get away with that is because the example regression case > Juan José added (which uses non-normals) does not conform to the > standard. I'm unsure about "conforming to standard", but I think it's reasonable to put the onus of doing normalization when necessary on the user. Otherwise, we need to move normalization logic into basically all the string processing functions (even texteq), which seems like a pretty huge cost that will benefit only a small minority of people. (If it's not a small minority, then where's the bug reports complaining that we don't do it today?) regards, tom lane
Re: Allow to_date() and to_timestamp() to accept localized names
On 2020-Jan-24, Peter Eisentraut wrote: > When you interact with glibc locale functions, you essentially have to > assume that everything is in NFC. When using ICU locale functions (which we > don't here, but just for the sake of argument), then I would expect that ICU > does appropriate normalization itself when I call > icu_what_month_is_this(string) returns int. So I think it is appropriate to > not deal with normalization in this patch. But that's a different POV. The input to this function could come from arbitrary user input from any application whatsoever. So the only reason we can get away with that is because the example regression case Juan José added (which uses non-normals) does not conform to the standard. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow to_date() and to_timestamp() to accept localized names
On 2020-01-24 12:44, Alvaro Herrera wrote: On 2020-Jan-24, Juan José Santamaría Flecha wrote: There is an open patch that will make the normalization functionality user visible [1]. So, if a user can call to_date(normalize('01 ŞUB 2010'), 'DD TMMON ') I would vote to drop the normalization logic inside this patch altogether. I was reading the SQL standard on this point, and it says this (4.2.8 Universal character sets): An SQL-implementation may assume that all UCS strings are normalized in one of [Unicode normalization forms]. which seems to agree with what you're saying. When you interact with glibc locale functions, you essentially have to assume that everything is in NFC. When using ICU locale functions (which we don't here, but just for the sake of argument), then I would expect that ICU does appropriate normalization itself when I call icu_what_month_is_this(string) returns int. So I think it is appropriate to not deal with normalization in this patch. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow to_date() and to_timestamp() to accept localized names
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= writes: > On Thu, Jan 23, 2020 at 11:00 PM Tom Lane wrote: >> * It's not exactly apparent to me why this code should be concerned >> about non-normalized characters when noplace else in the backend is. > There is an open patch that will make the normalization functionality user > visible [1]. So, if a user can call to_date(normalize('01 ŞUB 2010'), 'DD > TMMON ') I would vote to drop the normalization logic inside this patch > altogether. Works for me. > * I have no faith in this calculation that decides how long the match >> length was: >> *len = element_len + name_len - norm_len; > The proper logic would come from do_to_timestamp() receiving a normalized > "date_txt" input, so we do not operate with unnormalize and normalize > strings at the same time. No, that only solves half the problem, because the downcasing transformation can change the string length too. Two easy examples: * In German, I believe "ß" downcases to "ss". In Latin-1 encoding that's a length change, though I think it might accidentally not be in UTF8. * The Turks distinguish dotted and dotless "i", so that "İ" downcases to "i", and conversely "I" downcases to "ı". Those are length changes in UTF8, though not in whichever Latin-N encoding works for Turkish. Even if these cases happen not to apply to any month or day name of the relevant language, we still have a problem, arising from the fact that we're downcasing the whole remaining string --- so length changes after the match could occur anyway. > I would like to rise a couple of questions myself: > * When compiled with DEBUG_TO_FROM_CHAR, there is a warning "‘dump_node’ > defined but not used". Should we drop this function or uncomment its usage? Maybe, but I don't think it belongs in this patch. > * Would it be worth moving str_tolower(localized_name) > from seq_search_localized() into cache_locale_time()? I think it'd complicate tracking when that cache has to be invalidated (i.e. it now depends on more than just LC_TIME). On the whole I wouldn't bother unless someone does the measurements to show there'd be a useful speedup. regards, tom lane
Re: Allow to_date() and to_timestamp() to accept localized names
On 2020-Jan-24, Juan José Santamaría Flecha wrote: > There is an open patch that will make the normalization functionality user > visible [1]. So, if a user can call to_date(normalize('01 ŞUB 2010'), 'DD > TMMON ') I would vote to drop the normalization logic inside this patch > altogether. I was reading the SQL standard on this point, and it says this (4.2.8 Universal character sets): An SQL-implementation may assume that all UCS strings are normalized in one of [Unicode normalization forms]. which seems to agree with what you're saying. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow to_date() and to_timestamp() to accept localized names
On Thu, Jan 23, 2020 at 11:00 PM Tom Lane wrote: Thank you for your time looking into this. Here's a v7 patch, rebased over my recent hacking, and addressing > most of the complaints I raised in <31691.1579648...@sss.pgh.pa.us>. > However, I've got some new complaints now that I've looked harder > at the code: > > * It's not exactly apparent to me why this code should be concerned > about non-normalized characters when noplace else in the backend is. > I think we should either rip that out entirely, or move the logic > into str_tolower (and hence also str_toupper etc). I'd tend to favor > the former, but of course I don't speak any languages where this would > be an issue. Perhaps a better idea is to invent a new SQL function > that normalizes a given string, and expect users to call that first > if they'd like to_date() to match unnormalized text. > > There is an open patch that will make the normalization functionality user visible [1]. So, if a user can call to_date(normalize('01 ŞUB 2010'), 'DD TMMON ') I would vote to drop the normalization logic inside this patch altogether. * I have no faith in this calculation that decides how long the match > length was: > > *len = element_len + name_len - norm_len; > > I seriously doubt that this does the right thing if either the > normalization or the downcasing changed the byte length of the > string. I'm not actually sure how we can do that correctly. > There's no good way to know whether the changes happened within > the part of the "name" string that we matched, or the part beyond > what we matched, but we only want to count the former. So this > seems like a pretty hard problem, and even if this logic is somehow > correct as it stands, it needs a comment explaining why. > > The proper logic would come from do_to_timestamp() receiving a normalized "date_txt" input, so we do not operate with unnormalize and normalize strings at the same time. > * I'm still concerned about the risk of integer overflow in the > string allocations in seq_search_localized(). Those need to be > adjusted to be more paranoid, similar to the code in e.g. str_tolower. > Please find attached a patch with the normalization logic removed, thus no direct allocations in seq_search_localized(). I would like to rise a couple of questions myself: * When compiled with DEBUG_TO_FROM_CHAR, there is a warning "‘dump_node’ defined but not used". Should we drop this function or uncomment its usage? * Would it be worth moving str_tolower(localized_name) from seq_search_localized() into cache_locale_time()? [1] https://www.postgresql.org/message-id/014866c8-d7ff-2a4f-c185-cf7e3ceb7028%402ndquadrant.com Regards, Juan José Santamaría Flecha 0001-Allow-localized-month-names-to_date-v8.patch Description: Binary data
Re: Allow to_date() and to_timestamp() to accept localized names
Here's a v7 patch, rebased over my recent hacking, and addressing most of the complaints I raised in <31691.1579648...@sss.pgh.pa.us>. However, I've got some new complaints now that I've looked harder at the code: * It's not exactly apparent to me why this code should be concerned about non-normalized characters when noplace else in the backend is. I think we should either rip that out entirely, or move the logic into str_tolower (and hence also str_toupper etc). I'd tend to favor the former, but of course I don't speak any languages where this would be an issue. Perhaps a better idea is to invent a new SQL function that normalizes a given string, and expect users to call that first if they'd like to_date() to match unnormalized text. * I have no faith in this calculation that decides how long the match length was: *len = element_len + name_len - norm_len; I seriously doubt that this does the right thing if either the normalization or the downcasing changed the byte length of the string. I'm not actually sure how we can do that correctly. There's no good way to know whether the changes happened within the part of the "name" string that we matched, or the part beyond what we matched, but we only want to count the former. So this seems like a pretty hard problem, and even if this logic is somehow correct as it stands, it needs a comment explaining why. * I'm still concerned about the risk of integer overflow in the string allocations in seq_search_localized(). Those need to be adjusted to be more paranoid, similar to the code in e.g. str_tolower. regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 6c4359d..ff44496 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -5934,7 +5934,7 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); TM prefix -translation mode (print localized day and month names based on +translation mode (use localized day and month names based on ) TMMonth @@ -5966,8 +5966,13 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); TM does not include trailing blanks. + + + + + to_timestamp and to_date ignore - the TM modifier. + the case when receiving names as an input. diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index 792f9ca..2f39050 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -87,6 +87,7 @@ #include "catalog/pg_collation.h" #include "catalog/pg_type.h" +#include "common/unicode_norm.h" #include "mb/pg_wchar.h" #include "parser/scansup.h" #include "utils/builtins.h" @@ -1059,9 +1060,11 @@ static int from_char_parse_int_len(int *dest, const char **src, const int len, FormatNode *node, bool *have_error); static int from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_error); -static int seq_search(const char *name, const char *const *array, int *len); +static int seq_search_ascii(const char *name, const char *const *array, int *len); +static int seq_search_localized(const char *name, char **array, int *len); static int from_char_seq_search(int *dest, const char **src, const char *const *array, + char **localized_array, FormatNode *node, bool *have_error); static void do_to_timestamp(text *date_txt, text *fmt, bool std, struct pg_tm *tm, fsec_t *fsec, int *fprec, @@ -2459,7 +2462,7 @@ from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_er * suitable for comparisons to ASCII strings. */ static int -seq_search(const char *name, const char *const *array, int *len) +seq_search_ascii(const char *name, const char *const *array, int *len) { unsigned char firstc; const char *const *a; @@ -2505,8 +2508,92 @@ seq_search(const char *name, const char *const *array, int *len) } /* - * Perform a sequential search in 'array' for an entry matching the first - * character(s) of the 'src' string case-insensitively. + * Sequentially search an array of possibly non-English words for + * a case-insensitive match to the initial character(s) of "name". + * + * This has the same API as seq_search_ascii(), but we use a more general + * downcasing transformation to achieve case-insensitivity. + * + * The array is treated as const, but we don't declare it that way because + * the arrays exported by pg_locale.c aren't const. + */ +static int +seq_search_localized(const char *name, char **array, int *len) +{ + char **a; + char *lower_name; + char *norm_name; + int name_len; + int norm_len; + + *len = 0; + + /* empty string can't match anything */ + if (!*name) + return -1; + + name_len = strlen(name); + + /* Normalize name, if working in Unicode */ + if (GetDatabaseEncoding() == PG_UTF8) + { + pg_wchar *wchar_name;
Re: Allow to_date() and to_timestamp() to accept localized names
Alvaro Herrera writes: > Even whitespace is problematic for some languages, such as Afan, > mon "Qunxa Garablu";/ > "Naharsi Kudo";/ > "Ciggilta Kudo";/ > (etc) but I think whitespace-splitting is going to be more comprehensible > in the vast majority of cases, even if not perfect. Interesting. Given that to_date et al are often willing to ignore whitespace in input, I wonder whether we won't have some other issues with names like that --- not so much that basic cases wouldn't work, as that people might reasonably expect that, say, we'd be flexible about the amount of whitespace. Seems like a problem for another day though. In the meantime, I agree that "truncate at whitespace" is a better heuristic for the error messages than what we've got. I'll go make it so. regards, tom lane
Re: Allow to_date() and to_timestamp() to accept localized names
On 2020-Jan-23, Tom Lane wrote: > That particular case could be improved by stopping at a dash ... but > since this code is also used to match strings like "A.M.", we can't > just exclude punctuation in general. Breaking at whitespace seems > like a reasonable compromise. Yeah, and there are cases where dashes are used in names -- browsing through glibc for example I quickly found Akan, for which the month names are: mon "Sanda-ppn";/ "Kwakwar-gyefuo";/ "Ebw-benem";/ and so on. Even whitespace is problematic for some languages, such as Afan, mon "Qunxa Garablu";/ "Naharsi Kudo";/ "Ciggilta Kudo";/ (etc) but I think whitespace-splitting is going to be more comprehensible in the vast majority of cases, even if not perfect. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow to_date() and to_timestamp() to accept localized names
Alvaro Herrera writes: > On 2020-Jan-22, Tom Lane wrote: >> Arthur Zakirov writes: >>> Shouldn't we just show all remaining string instead of truncating it? >> That would avoid a bunch of arbitrary decisions, for sure. >> Anybody have an objection? > I think it would be clearer to search for whitespace starting at the > start of the bogus token and stop there. It might not be perfect, > particularly if any language has whitespace in a month etc name (I don't > know any example but I guess it's not impossible for it to exist; > Portuguese weekday names maybe?), but it seems better than either of the > behaviors shown above. I'm okay with that. It won't work so well for cases like "1-Januzry-1999", but it's still better than what happens now: regression=# select to_date('1-Januzry-1999', 'DD MONTH '); ERROR: invalid value "Januzry-1" for "MONTH" That particular case could be improved by stopping at a dash ... but since this code is also used to match strings like "A.M.", we can't just exclude punctuation in general. Breaking at whitespace seems like a reasonable compromise. regards, tom lane
Re: Allow to_date() and to_timestamp() to accept localized names
On 2020-Jan-22, Tom Lane wrote: > Arthur Zakirov writes: > > On 2020/01/23 7:11, Tom Lane wrote: > >> Closer examination shows that the "max" argument is pretty bogus as > >> well. It doesn't do anything except confuse the reader, because there > >> are no cases where the value passed is less than the maximum array entry > >> length, so it never acts to change seq_search's behavior. So we should > >> just drop that behavior from seq_search, too, and redefine "max" as > >> having no purpose except to specify how much of the string to show in > >> error messages. There's still a question of what that should be for > >> non-English cases, but at least we now have a clear idea of what we > >> need the value to do. > > > Shouldn't we just show all remaining string instead of truncating it? > > That would avoid a bunch of arbitrary decisions, for sure. > Anybody have an objection? I think it would be clearer to search for whitespace starting at the start of the bogus token and stop there. It might not be perfect, particularly if any language has whitespace in a month etc name (I don't know any example but I guess it's not impossible for it to exist; Portuguese weekday names maybe?), but it seems better than either of the behaviors shown above. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow to_date() and to_timestamp() to accept localized names
Arthur Zakirov writes: > On 2020/01/23 7:11, Tom Lane wrote: >> Closer examination shows that the "max" argument is pretty bogus as >> well. It doesn't do anything except confuse the reader, because there >> are no cases where the value passed is less than the maximum array entry >> length, so it never acts to change seq_search's behavior. So we should >> just drop that behavior from seq_search, too, and redefine "max" as >> having no purpose except to specify how much of the string to show in >> error messages. There's still a question of what that should be for >> non-English cases, but at least we now have a clear idea of what we >> need the value to do. > Shouldn't we just show all remaining string instead of truncating it? That would avoid a bunch of arbitrary decisions, for sure. Anybody have an objection? regards, tom lane
Re: Allow to_date() and to_timestamp() to accept localized names
On 2020/01/23 7:11, Tom Lane wrote: Closer examination shows that the "max" argument is pretty bogus as well. It doesn't do anything except confuse the reader, because there are no cases where the value passed is less than the maximum array entry length, so it never acts to change seq_search's behavior. So we should just drop that behavior from seq_search, too, and redefine "max" as having no purpose except to specify how much of the string to show in error messages. There's still a question of what that should be for non-English cases, but at least we now have a clear idea of what we need the value to do. Shouldn't we just show all remaining string instead of truncating it? For example I get the following output: =# select to_timestamp('3 ЯНВАРЯ 1999', 'DD MONTH '); ERROR: invalid value "ЯНВА" for "MONTH" DETAIL: The given value did not match any of the allowed values for this field. This behavior is reproduced without the patch though (on Postgres 12). And the English case might confuse too I think: =# select to_date('3 JANUZRY 1999', 'DD MONTH '); ERROR: invalid value "JANUZRY 1" for "MONTH" DETAIL: The given value did not match any of the allowed values for this field. Users might think what means "1" in "JANUZRY 1" string. I attached the draft patch, which shows all remaining string. So the query above will show the following: =# select to_timestamp('3 ЯНВАРЯ 1999', 'DD MONTH '); ERROR: invalid value "ЯНВАРЯ 1999" for "MONTH" DETAIL: The remaining value did not match any of the allowed values for this field. The 0001 patch attached cleans up all the above complaints. I felt that given the business about scribbling on strings we shouldn't, it would also be wise to const-ify the string pointers if possible. That seems not too painful, per 0002 attached. +1 on the patch. -- Arthur diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index ef21b7e426..63e773c4c1 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -2539,17 +2539,11 @@ from_char_seq_search(int *dest, const char **src, const char *const *array, if (len <= 0) { - charcopy[DCH_MAX_ITEM_SIZ + 1]; - - /* Use multibyte-aware truncation to avoid generating a bogus string */ - max = pg_mbcliplen(*src, strlen(*src), max); - strlcpy(copy, *src, max + 1); - RETURN_ERROR(ereport(ERROR, (errcode(ERRCODE_INVALID_DATETIME_FORMAT), errmsg("invalid value \"%s\" for \"%s\"", -copy, node->key->name), - errdetail("The given value did not match any of " +*src, node->key->name), + errdetail("The remaining value did not match any of " "the allowed values for this field."; } *src += len;
Re: Allow to_date() and to_timestamp() to accept localized names
I wrote: > One thing I completely don't understand is why it's sufficient for > seq_search_localized to do "initcap" semantics. Shouldn't it cover > the all-upper and all-lower cases as well, as the existing seq_search > code does? (That is, why is it okay to ignore the "type" argument?) I took a closer look at this and decided you were probably doing that just because the seq_search code uses initcap-style case folding to match month and day names, relying on the assumption that the constant arrays it's passed are cased that way. But we shouldn't (and the patch doesn't) assume that the localized names we'll get from pg_locale.c are cased that way. However ... it seems to me that the way seq_search is defined is plenty bogus. In the first place, it scribbles on its source string, which is surely not okay. You can see that in error messages that are thrown on match failures: regression=# select to_date('3 JANUZRY 1999', 'DD MONTH '); ERROR: invalid value "JanuzRY 1" for "MONTH" DETAIL: The given value did not match any of the allowed values for this field. Now, maybe somebody out there thinks this is a cute way of highlighting how much of the string we examined, but it looks more like a bug from where I sit. It's mere luck that what's being clobbered is a local string copy created by do_to_timestamp(), and not the original input data passed to to_date(). In the second place, there's really no need at all for seq_search to rely on any assumptions about the case state of the array it's searching. pg_toupper() is pretty cheap already, and we can make it guaranteed cheap if we use pg_ascii_toupper() instead. So I think we ought to just remove the "type" argument altogether and have seq_search dynamically convert all the strings it examines to upper case (or lower case would work as well, at least for English). > On the other hand, you probably *should* be ignoring the "max" > argument, because AFAICS the values that are passed for that are > specific to the English ASCII spellings of the day and month names. > It seems very likely that they could be too small for some sets of > non-English names. Closer examination shows that the "max" argument is pretty bogus as well. It doesn't do anything except confuse the reader, because there are no cases where the value passed is less than the maximum array entry length, so it never acts to change seq_search's behavior. So we should just drop that behavior from seq_search, too, and redefine "max" as having no purpose except to specify how much of the string to show in error messages. There's still a question of what that should be for non-English cases, but at least we now have a clear idea of what we need the value to do. I also noted while I was looking at it that from_char_seq_search() would truncate at "max" bytes even when dealing with multibyte input. That has a clear risk of generating an invalidly-encoded error message. The 0001 patch attached cleans up all the above complaints. I felt that given the business about scribbling on strings we shouldn't, it would also be wise to const-ify the string pointers if possible. That seems not too painful, per 0002 attached. I'm tempted to go a bit further than 0001 does, and remove the 'max' argument from from_char_seq_search() altogether. Since we only need 'max' in error cases, which don't need to be super-quick, we could drop the requirement for the caller to specify that and instead compute it when we do need it as the largest of the constant array's string lengths. That'd carry over into the localized-names case in a reasonably straightforward way (though we might need to count characters not bytes for that to work nicely). Because of the risk of incorrectly-encoded error messages, I'm rather tempted to claim that these patches (or at least 0001) are a bug fix and should be back-patched. In any case I think we should apply this to HEAD and then rebase the localized-names patch over it. It makes a whole lot more sense, IMO, for the localized-names comparison logic to match what this is doing than what seq_search does today. Comments? regards, tom lane diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index ca3c48d..3ef10dc 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -317,10 +317,6 @@ static const char *const numth[] = {"st", "nd", "rd", "th", NULL}; * Flags & Options: * -- */ -#define ONE_UPPER 1 /* Name */ -#define ALL_UPPER 2 /* NAME */ -#define ALL_LOWER 3 /* name */ - #define MAX_MONTH_LEN 9 #define MAX_MON_LEN 3 #define MAX_DAY_LEN 9 @@ -1068,9 +1064,9 @@ static int from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node, bool *have_error); static int from_char_parse_int(int *dest, char **src, FormatNode *node, bool *have_error); -static int seq_search(char *name, const char *const *array, int type, int max,
Re: Allow to_date() and to_timestamp() to accept localized names
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= writes: > [ 0001-Allow-localized-month-names-to_date-v6.patch ] I took a quick look through this. One thing I completely don't understand is why it's sufficient for seq_search_localized to do "initcap" semantics. Shouldn't it cover the all-upper and all-lower cases as well, as the existing seq_search code does? (That is, why is it okay to ignore the "type" argument?) On the other hand, you probably *should* be ignoring the "max" argument, because AFAICS the values that are passed for that are specific to the English ASCII spellings of the day and month names. It seems very likely that they could be too small for some sets of non-English names. Related to that, the business with + mb_max = max * pg_encoding_max_length(encoding); + name_len = strlen(name); + name_len = name_len < mb_max ? name_len : mb_max; seems wrong even if we assume that "max" is a sane limit on the number of characters to consider. This coding is likely to truncate a long "name" string in the middle of a multibyte character, leading to bad-encoding errors from later functions. I also am confused by the "if (mb_max > max ...)" bit. It looks to me like that's an obscure way of writing "if (pg_encoding_max_length() > 1)", which is a rather pointless check considering that the if() then goes on to insist on encoding == PG_UTF8. A bit further down, you have an "if (name_wlen > norm_wlen)" condition that seems pretty fishy. Is it really guaranteed that unicode_normalize_kc cannot transform the string without shortening it? I don't see that specified in its header comment, for sure. Also, it's purely accidental if this: norm_name = (char *) palloc((norm_wlen + 1) * sizeof(pg_wchar)); allocates a long enough string for the conversion back to multibyte form, because that output is not pg_wchars. I think you want something more like "norm_wlen * MAX_MULTIBYTE_CHAR_LEN + 1". (I wonder whether we need to be worrying about integer overflow in any of this.) It seems like you could eliminate a lot of messiness by extending localized_abbrev_days[] and related arrays by one element that's always NULL. That would waste, hmm, four 8-byte pointers on typical machines --- but you're eating a lot more than 32 bytes of code to pass around the array lengths, plus it's really ugly that the plain and localized arrays are handled differently. I don't think the way you're managing the "localized_names" variable in DCH_from_char is very sensible. The reader has to do a lot of reverse engineering just to discover that the value is constant NULL in most references, something that you've not helped by declaring it outside the loop rather than inside. I think you could drop the variable and write constant NULL in most places, with something like S_TM(n->suffix) ? localized_full_days : NULL in those from_char_seq_search calls where it's relevant. In general I'm displeased with the lack of attention to comments. Notably, you added arguments to from_char_seq_search without updating its header comment ... not that that comment is a great API spec, but at the least you shouldn't be making it worse. I think the bug I complained of above is directly traceable to the lack of a clear spec here for what "max" means, so failure to keep these comments up to speed does have demonstrable bad consequences. regards, tom lane
Re: Allow to_date() and to_timestamp() to accept localized names
On Wed, Jan 15, 2020 at 11:20 AM Arthur Zakirov wrote: > > I have some couple picky notes. > > Thanks for the review. > > + if (name_len != norm_len) > > + pfree(norm_name); > > I'm not sure here. Is it save to assume that if it was allocated a new > norm_name name_len and norm_len always will differ? > Good call, that check can be more robust. > > > static const char *const months_full[] = { > > "January", "February", "March", "April", "May", "June", "July", > > - "August", "September", "October", "November", "December", NULL > > + "August", "September", "October", "November", "December" > > }; > > Unfortunately I didn't see from the patch why it was necessary to remove > last null element from months_full, days_short, adbc_strings, > adbc_strings_long and other arrays. I think it is not good because > unnecessarily increases the patch and adds code like the following: > > > + from_char_seq_search(&value, &s, months, > localized_names, > > + > ONE_UPPER, MAX_MON_LEN, n, have_error, > > + > lengthof(months_full)); > > Here it passes "months" from datetime.c to from_char_seq_search() and > calculates its length using "months_full" array from formatting.c. > With the introduction of seq_search_localized() that can be avoided, minimizing code churn. Please, find attached a version addressing the above mentioned. Regards, Juan José Santamaría Flecha > > 0001-Allow-localized-month-names-to_date-v6.patch Description: Binary data
Re: Allow to_date() and to_timestamp() to accept localized names
Hello! On 2020/01/13 21:04, Juan José Santamaría Flecha wrote: Please, find attached a version addressing the above mentioned. I have some couple picky notes. + if (name_len != norm_len) + pfree(norm_name); I'm not sure here. Is it save to assume that if it was allocated a new norm_name name_len and norm_len always will differ? static const char *const months_full[] = { "January", "February", "March", "April", "May", "June", "July", - "August", "September", "October", "November", "December", NULL + "August", "September", "October", "November", "December" }; Unfortunately I didn't see from the patch why it was necessary to remove last null element from months_full, days_short, adbc_strings, adbc_strings_long and other arrays. I think it is not good because unnecessarily increases the patch and adds code like the following: + from_char_seq_search(&value, &s, months, localized_names, + ONE_UPPER, MAX_MON_LEN, n, have_error, + lengthof(months_full)); Here it passes "months" from datetime.c to from_char_seq_search() and calculates its length using "months_full" array from formatting.c. -- Arthur
Re: Allow to_date() and to_timestamp() to accept localized names
On Sat, Jan 11, 2020 at 5:06 PM Tomas Vondra wrote: > > Thanks. I did a quick review of this patch, and I think it's almost RFC. > > Thanks for reviewing. > - In func.sgml, it seems we've lost this bit: > > > TM does not include trailing blanks. > to_timestamp and to_date > ignore > the TM modifier. > > >Does that mean the function no longer ignore the TM modifier? That >would be somewhat problematic (i.e. it might break some user code). >But looking at the code I don't see why the patch would have this >effect, so I suppose it's merely a doc bug. > > It is intentional. This patch uses the TM modifier to identify the usage of localized names as input for to_timestamp() and to_date(). > - I don't think we need to include examples how to_timestmap ignores >case, I'd say just stating the fact is clear enough. But if we want to >have examples, I think we should not inline in the para but use the >established pattern: > > > Some examples: > > ... > > > >which is used elsewhere in the func.sgml file. > I was trying to match the style surrounding the usage notes for date/time formatting [1]. Agreed that it is not worth an example on its own, so dropped. > > - In formatting.c the "common/unicode_norm.h" should be right after >includes from "catalog/" to follow the alphabetic order (unless >there's a reason why that does not work). > Fixed. > > - I rather dislike the "dim" parameter name, because I immediately think >"dimension" which makes little sense. I suggest renaming to "nitems" >or "nelements" or something like that. > Agreed, using "nelements" as a better style matchup. Please, find attached a version addressing the above mentioned. [1] https://www.postgresql.org/docs/current/functions-formatting.html Regards, Juan José Santamaría Flecha > 0001-Allow-localized-month-names-to_date-v5.patch Description: Binary data
Re: Allow to_date() and to_timestamp() to accept localized names
On Thu, Sep 26, 2019 at 08:36:25PM +0200, Juan José Santamaría Flecha wrote: On Wed, Sep 25, 2019 at 9:57 PM Alvaro Herrera wrote: This patch no longer applies. Can you please rebase? Thank you for the notification. The patch rot after commit 1a950f3, a rebased version is attached. Thanks. I did a quick review of this patch, and I think it's almost RFC. I only found a couple of minor issue: - In func.sgml, it seems we've lost this bit: TM does not include trailing blanks. to_timestamp and to_date ignore the TM modifier. Does that mean the function no longer ignore the TM modifier? That would be somewhat problematic (i.e. it might break some user code). But looking at the code I don't see why the patch would have this effect, so I suppose it's merely a doc bug. - I don't think we need to include examples how to_timestmap ignores case, I'd say just stating the fact is clear enough. But if we want to have examples, I think we should not inline in the para but use the established pattern: Some examples: ... which is used elsewhere in the func.sgml file. - In formatting.c the "common/unicode_norm.h" should be right after includes from "catalog/" to follow the alphabetic order (unless there's a reason why that does not work). - I rather dislike the "dim" parameter name, because I immediately think "dimension" which makes little sense. I suggest renaming to "nitems" or "nelements" or something like that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow to_date() and to_timestamp() to accept localized names
On Wed, Sep 25, 2019 at 9:57 PM Alvaro Herrera wrote: > > This patch no longer applies. Can you please rebase? Thank you for the notification. The patch rot after commit 1a950f3, a rebased version is attached. Regards, Juan José Santamaría Flecha diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 67f1a82..bf9055e 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -6578,8 +6578,17 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); TM does not include trailing blanks. + + + + + to_timestamp and to_date ignore - the TM modifier. + the case when receiving names as an input. For example, either + to_timestamp('2000-JUN', '-MON') or + to_timestamp('2000-Jun', '-MON') or + to_timestamp('2000-JUN', '-mon') work and return + the same output diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index f7175df..d69d89a 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -97,6 +97,7 @@ #include "utils/memutils.h" #include "utils/numeric.h" #include "utils/pg_locale.h" +#include "common/unicode_norm.h" /* -- * Convenience macros for error handling @@ -220,11 +221,11 @@ typedef struct */ static const char *const months_full[] = { "January", "February", "March", "April", "May", "June", "July", - "August", "September", "October", "November", "December", NULL + "August", "September", "October", "November", "December" }; static const char *const days_short[] = { - "Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", NULL + "Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat" }; /* -- @@ -256,8 +257,8 @@ static const char *const days_short[] = { * matches for BC have an odd index. So the boolean value for BC is given by * taking the array index of the match, modulo 2. */ -static const char *const adbc_strings[] = {ad_STR, bc_STR, AD_STR, BC_STR, NULL}; -static const char *const adbc_strings_long[] = {a_d_STR, b_c_STR, A_D_STR, B_C_STR, NULL}; +static const char *const adbc_strings[] = {ad_STR, bc_STR, AD_STR, BC_STR}; +static const char *const adbc_strings_long[] = {a_d_STR, b_c_STR, A_D_STR, B_C_STR}; /* -- * AM / PM @@ -283,8 +284,8 @@ static const char *const adbc_strings_long[] = {a_d_STR, b_c_STR, A_D_STR, B_C_S * matches for PM have an odd index. So the boolean value for PM is given by * taking the array index of the match, modulo 2. */ -static const char *const ampm_strings[] = {am_STR, pm_STR, AM_STR, PM_STR, NULL}; -static const char *const ampm_strings_long[] = {a_m_STR, p_m_STR, A_M_STR, P_M_STR, NULL}; +static const char *const ampm_strings[] = {am_STR, pm_STR, AM_STR, PM_STR}; +static const char *const ampm_strings_long[] = {a_m_STR, p_m_STR, A_M_STR, P_M_STR}; /* -- * Months in roman-numeral @@ -293,10 +294,10 @@ static const char *const ampm_strings_long[] = {a_m_STR, p_m_STR, A_M_STR, P_M_S * -- */ static const char *const rm_months_upper[] = -{"XII", "XI", "X", "IX", "VIII", "VII", "VI", "V", "IV", "III", "II", "I", NULL}; +{"XII", "XI", "X", "IX", "VIII", "VII", "VI", "V", "IV", "III", "II", "I"}; static const char *const rm_months_lower[] = -{"xii", "xi", "x", "ix", "viii", "vii", "vi", "v", "iv", "iii", "ii", "i", NULL}; +{"xii", "xi", "x", "ix", "viii", "vii", "vi", "v", "iv", "iii", "ii", "i"}; /* -- * Roman numbers @@ -1068,10 +1069,11 @@ static int from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node, bool *have_error); static int from_char_parse_int(int *dest, char **src, FormatNode *node, bool *have_error); -static int seq_search(char *name, const char *const *array, int type, int max, int *len); +static int seq_search_sqlascii(char *name, const char *const *array, int type, int max, int *len, int dim); +static int seq_search_localized(char *name, char **array, int max, int *len, int dim); static int from_char_seq_search(int *dest, char **src, - const char *const *array, int type, int max, - FormatNode *node, bool *have_error); + const char *const *array, char **localized_array, int type, int max, + FormatNode *node, bool *have_error, int dim); static void do_to_timestamp(text *date_txt, text *fmt, bool std, struct pg_tm *tm, fsec_t *fsec, int *fprec, uint32 *flags, bool *have_error); @@ -2454,17 +2456,18 @@ from_char_parse_int(int *dest, char **src, FormatNode *node, bool *have_error) } /* -- - * Sequential search with to upper/lower conversion + * Sequential search with to upper/lower conversion for SQL_ASCII array input * -- */ static int -seq_search(char *name, const char *const *array, int type, int max, int *len) +seq_search_sqlascii(char *name, const char *const *array, int type, int max, int *len, int dim) { const char *p; const char
Re: Allow to_date() and to_timestamp() to accept localized names
This patch no longer applies. Can you please rebase? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow to_date() and to_timestamp() to accept localized names
On Wed, Sep 18, 2019 at 11:09 AM Juan José Santamaría Flecha wrote: > > On Fri, Sep 13, 2019 at 10:31 PM Alvaro Herrera > wrote: > > > Thanks for taking a look at this. > > > I'm confused why we acquire the MONTH_DIM / etc definitions. Can't we > > just use lengthof() of the corresponding array? AFAICS it should work > > just as well. > > > > It was because of the length difference between ascii-name arrays, > which were all null-ended, and localized-name arrays. The attached > version uses lengthof(). > > > I wonder if the "compare first char" thing (seq_search_localized) really > > works when there are multibyte chars in the day/month names. I think > > the code compares just the first char ... but what if the original > > string uses those funny Unicode non-normalized letters and the locale > > translation uses normalized letters? My guess is that the first-char > > comparison will fail, but surely you'll want the name to match. > > (There's no month/day name in Spanish that doesn't start with an ASCII > > letter, but I bet there are some in other languages.) I think the > > localized comparison should avoid the first-char optimization, just > > compare the whole string all the time, and avoid possible weird issues. > > The localized search is reformulated in this version to deal with > multibyte normalization. A regression test for this issue is included. This version is updated to optimize the need for dynamic allocation. > Regards, > > Juan José Santamaría Flecha diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 4e3e213..c470c0e 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -6415,8 +6415,17 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); TM does not include trailing blanks. + + + + + to_timestamp and to_date ignore - the TM modifier. + the case when receiving names as an input. For example, either + to_timestamp('2000-JUN', '-MON') or + to_timestamp('2000-Jun', '-MON') or + to_timestamp('2000-JUN', '-mon') work and return + the same output. diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index 755ca6e..39d2d11 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -96,6 +96,7 @@ #include "utils/memutils.h" #include "utils/numeric.h" #include "utils/pg_locale.h" +#include "common/unicode_norm.h" /* -- * Routines type @@ -174,18 +175,17 @@ typedef struct #define CLOCK_24_HOUR 0 #define CLOCK_12_HOUR 1 - /* -- * Full months * -- */ static const char *const months_full[] = { "January", "February", "March", "April", "May", "June", "July", - "August", "September", "October", "November", "December", NULL + "August", "September", "October", "November", "December" }; static const char *const days_short[] = { - "Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", NULL + "Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat" }; /* -- @@ -217,8 +217,8 @@ static const char *const days_short[] = { * matches for BC have an odd index. So the boolean value for BC is given by * taking the array index of the match, modulo 2. */ -static const char *const adbc_strings[] = {ad_STR, bc_STR, AD_STR, BC_STR, NULL}; -static const char *const adbc_strings_long[] = {a_d_STR, b_c_STR, A_D_STR, B_C_STR, NULL}; +static const char *const adbc_strings[] = {ad_STR, bc_STR, AD_STR, BC_STR}; +static const char *const adbc_strings_long[] = {a_d_STR, b_c_STR, A_D_STR, B_C_STR}; /* -- * AM / PM @@ -244,8 +244,8 @@ static const char *const adbc_strings_long[] = {a_d_STR, b_c_STR, A_D_STR, B_C_S * matches for PM have an odd index. So the boolean value for PM is given by * taking the array index of the match, modulo 2. */ -static const char *const ampm_strings[] = {am_STR, pm_STR, AM_STR, PM_STR, NULL}; -static const char *const ampm_strings_long[] = {a_m_STR, p_m_STR, A_M_STR, P_M_STR, NULL}; +static const char *const ampm_strings[] = {am_STR, pm_STR, AM_STR, PM_STR}; +static const char *const ampm_strings_long[] = {a_m_STR, p_m_STR, A_M_STR, P_M_STR}; /* -- * Months in roman-numeral @@ -254,10 +254,10 @@ static const char *const ampm_strings_long[] = {a_m_STR, p_m_STR, A_M_STR, P_M_S * -- */ static const char *const rm_months_upper[] = -{"XII", "XI", "X", "IX", "VIII", "VII", "VI", "V", "IV", "III", "II", "I", NULL}; +{"XII", "XI", "X", "IX", "VIII", "VII", "VI", "V", "IV", "III", "II", "I"}; static const char *const rm_months_lower[] = -{"xii", "xi", "x", "ix", "viii", "vii", "vi", "v", "iv", "iii", "ii", "i", NULL}; +{"xii", "xi", "x", "ix", "viii", "vii", "vi", "v", "iv", "iii", "ii", "i"}; /* -- * Roman numbers @@ -975,7 +975,7 @@ static void parse_format(FormatNode *node, const char *str, const KeyWord *kw, static void DCH_to_char(FormatNode *node, bool is_
Re: Allow to_date() and to_timestamp() to accept localized names
On Fri, Sep 13, 2019 at 10:31 PM Alvaro Herrera wrote: > Thanks for taking a look at this. > I'm confused why we acquire the MONTH_DIM / etc definitions. Can't we > just use lengthof() of the corresponding array? AFAICS it should work > just as well. > It was because of the length difference between ascii-name arrays, which were all null-ended, and localized-name arrays. The attached version uses lengthof(). > I wonder if the "compare first char" thing (seq_search_localized) really > works when there are multibyte chars in the day/month names. I think > the code compares just the first char ... but what if the original > string uses those funny Unicode non-normalized letters and the locale > translation uses normalized letters? My guess is that the first-char > comparison will fail, but surely you'll want the name to match. > (There's no month/day name in Spanish that doesn't start with an ASCII > letter, but I bet there are some in other languages.) I think the > localized comparison should avoid the first-char optimization, just > compare the whole string all the time, and avoid possible weird issues. The localized search is reformulated in this version to deal with multibyte normalization. A regression test for this issue is included. Regards, Juan José Santamaría Flecha 0001-Allow-localized-month-names-to_date-v2.patch Description: Binary data
Re: Allow to_date() and to_timestamp() to accept localized names
On 2019-Aug-22, Juan José Santamaría Flecha wrote: > On Sun, Aug 18, 2019 at 10:42 AM Juan José Santamaría Flecha > wrote: > > > > Going through the current items in the wiki's todo list, I have been > > looking into: "Allow to_date () and to_timestamp () to accept > > localized month names". > > I have gone through a second take on this, trying to give it a better > shape but it would surely benefit from some review, so I will open an > item in the commitfest. I'm confused why we acquire the MONTH_DIM / etc definitions. Can't we just use lengthof() of the corresponding array? AFAICS it should work just as well. I wonder if the "compare first char" thing (seq_search_localized) really works when there are multibyte chars in the day/month names. I think the code compares just the first char ... but what if the original string uses those funny Unicode non-normalized letters and the locale translation uses normalized letters? My guess is that the first-char comparison will fail, but surely you'll want the name to match. (There's no month/day name in Spanish that doesn't start with an ASCII letter, but I bet there are some in other languages.) I think the localized comparison should avoid the first-char optimization, just compare the whole string all the time, and avoid possible weird issues. But then, I'm not 100% sure of this code. formatting.c is madness distilled. Thanks, -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow to_date() and to_timestamp() to accept localized names
On Thu, Aug 22, 2019 at 9:38 PM Juan José Santamaría Flecha wrote: > > > > > Going through the current items in the wiki's todo list, I have been > > looking into: "Allow to_date () and to_timestamp () to accept > > localized month names". > > > > I have gone through a second take on this, trying to give it a better > shape but it would surely benefit from some review, so I will open an > item in the commitfest. For reviewers, the current aproach for this patch is: Break seq_search() into two functions: * seq_search_sqlascii() that supports seq_search() current usage. * seq_search_localized() similar to the previous but supports multibyte input. To avoid code duplication most of current seq_search() logic has been moved to a new function str_compare(). from_char_seq_search() is now responsible to choose between seq_search_sqlascii() or seq_search_localized(). Also, since localized names is not a null terminated array, seq_search() now receives the dimension as input and terminating nulls have been removed from static arrays. The commitfest item is: https://commitfest.postgresql.org/24/2255/ Regards, Juan José Santamaría Flecha
Re: Allow to_date() and to_timestamp() to accept localized names
On Sun, Aug 18, 2019 at 10:42 AM Juan José Santamaría Flecha wrote: > > Going through the current items in the wiki's todo list, I have been > looking into: "Allow to_date () and to_timestamp () to accept > localized month names". > I have gone through a second take on this, trying to give it a better shape but it would surely benefit from some review, so I will open an item in the commitfest. Regards, Juan José Santamaría Flecha 0001-Allow-localized-month-names-to_date-v1.patch Description: Binary data