Re: Allow to_date() and to_timestamp() to accept localized names

2020-03-08 Thread James Coleman
On Sun, Mar 8, 2020 at 6:03 PM Tom Lane  wrote:

> James Coleman  writes:
> > So just to confirm I understand, that implies that the issue is solely
> that
> > only the utf8 tr_TR set is installed by default on this machine, and the
> > iso-8859-9 set is a hard requirement (that is, the test is explicitly
> > testing a codepath that generates utf8 results from a non-utf8 source)?
>
> It's not "explicitly" testing that; the fact that "tr_TR" is treated
> as "tr_TR.iso88599" is surely an implementation artifact.  (On macOS,
> some experimentation shows that "tr_TR" is treated as "tr_TR.UTF-8".)
> But yeah, I think it's intentional that we want the codeset translation
> path to be exercised here on at least some platforms.
>

I idly wonder if there could/should be some tests for the implicit case,
some explicitly testing the codeset translation (if possible), and some
testing the explicit utf8 case...but I don't know enough about this area to
push for anything.


> > If in fact Ubuntu doesn't install this locale by default, then is this a
> > caveat we should add to developer docs somewhere? It seems odd to me that
> > I'd be the only one encountering it, but OTOH I would have thought this a
> > fairly vanilla install too...
>
> Not sure.  The lack of prior complaints points to this not being a
> common situation.  It does seem weird that they'd set things up so
> that "tr_TR.utf8" exists but not "tr_TR"; even if that's not an
> outright packaging mistake, it seems like a POLA violation from here.
>
> regards, tom lane
>

All right. I downloaded an Ubuntu 18.04.3 server VM from OSBoxes, and it
had very few locales installed by default...so that wasn't all that helpful.

I think at this point I'll just leave this as a mystery, much as I hate
that.

James


Re: Allow to_date() and to_timestamp() to accept localized names

2020-03-08 Thread Tom Lane
James Coleman  writes:
> So just to confirm I understand, that implies that the issue is solely that
> only the utf8 tr_TR set is installed by default on this machine, and the
> iso-8859-9 set is a hard requirement (that is, the test is explicitly
> testing a codepath that generates utf8 results from a non-utf8 source)?

It's not "explicitly" testing that; the fact that "tr_TR" is treated
as "tr_TR.iso88599" is surely an implementation artifact.  (On macOS,
some experimentation shows that "tr_TR" is treated as "tr_TR.UTF-8".)
But yeah, I think it's intentional that we want the codeset translation
path to be exercised here on at least some platforms.

> If in fact Ubuntu doesn't install this locale by default, then is this a
> caveat we should add to developer docs somewhere? It seems odd to me that
> I'd be the only one encountering it, but OTOH I would have thought this a
> fairly vanilla install too...

Not sure.  The lack of prior complaints points to this not being a
common situation.  It does seem weird that they'd set things up so
that "tr_TR.utf8" exists but not "tr_TR"; even if that's not an
outright packaging mistake, it seems like a POLA violation from here.

regards, tom lane




Re: Allow to_date() and to_timestamp() to accept localized names

2020-03-08 Thread James Coleman
On Sun, Mar 8, 2020 at 2:19 PM Tom Lane  wrote:

> I wrote:
> > James Coleman  writes:
> >> I'm still interested in understanding why we're using the ISO locale
> >> instead of the utf8 one in a utf8-labeled test though.
>
> > We are not.  My understanding of the rules about this is that the
> > active LC_CTYPE setting determines the encoding that libc uses,
> > period.  The encoding suffix on the locale name only makes a
> > difference when LC_CTYPE is being specified (or derived from LANG or
> > LC_ALL), not any other LC_XXX setting --- although for consistency
> > they'll let you include it in any LC_XXX value.
>
> Oh wait --- I'm wrong about that.  Looking at the code in pg_locale.c,
> what actually happens is that we get data in the codeset implied by
> the LC_TIME setting and then translate it to the database encoding
> (cf commit 7ad1cd31b).  So if bare "tr_TR" is taken as implying
> iso-8859-9, which seems likely (it appears to work that way here,
> anyway) then this test is exercising the codeset translation path.
> We could change the test to say 'tr_TR.utf8' but that would give us
> less test coverage.
>

So just to confirm I understand, that implies that the issue is solely that
only the utf8 tr_TR set is installed by default on this machine, and the
iso-8859-9 set is a hard requirement (that is, the test is explicitly
testing a codepath that generates utf8 results from a non-utf8 source)?

If so, I'm going to try a bare Ubuntu install on a VM and see what locales
are installed by default for Turkish.

If in fact Ubuntu doesn't install this locale by default, then is this a
caveat we should add to developer docs somewhere? It seems odd to me that
I'd be the only one encountering it, but OTOH I would have thought this a
fairly vanilla install too...

James


Re: Allow to_date() and to_timestamp() to accept localized names

2020-03-08 Thread Tom Lane
I wrote:
> James Coleman  writes:
>> I'm still interested in understanding why we're using the ISO locale
>> instead of the utf8 one in a utf8-labeled test though.

> We are not.  My understanding of the rules about this is that the
> active LC_CTYPE setting determines the encoding that libc uses,
> period.  The encoding suffix on the locale name only makes a
> difference when LC_CTYPE is being specified (or derived from LANG or
> LC_ALL), not any other LC_XXX setting --- although for consistency
> they'll let you include it in any LC_XXX value.

Oh wait --- I'm wrong about that.  Looking at the code in pg_locale.c,
what actually happens is that we get data in the codeset implied by
the LC_TIME setting and then translate it to the database encoding
(cf commit 7ad1cd31b).  So if bare "tr_TR" is taken as implying
iso-8859-9, which seems likely (it appears to work that way here,
anyway) then this test is exercising the codeset translation path.
We could change the test to say 'tr_TR.utf8' but that would give us
less test coverage.

regards, tom lane




Re: Allow to_date() and to_timestamp() to accept localized names

2020-03-08 Thread Tom Lane
James Coleman  writes:
> Since sudo dpkg-reconfigure locales opens up an ncurses gui on my machine,
> I tried selecting the tr_TR.ISO-8859-9 option there and removed the
> /var/lib/locales/supported.d/local file. Now I get:

> $ locale -a | grep tr_TR
> tr_TR
> tr_TR.iso88599
> tr_TR.utf8

> And now `make check` passes.

Hm.

> I'm still interested in understanding why we're using the ISO locale
> instead of the utf8 one in a utf8-labeled test though.

We are not.  My understanding of the rules about this is that the
active LC_CTYPE setting determines the encoding that libc uses,
period.  The encoding suffix on the locale name only makes a
difference when LC_CTYPE is being specified (or derived from LANG or
LC_ALL), not any other LC_XXX setting --- although for consistency
they'll let you include it in any LC_XXX value.

We could of course choose to spell LC_TIME as 'tr_TR.utf8' in this
test, but that would open up the question of whether that causes
problems on platforms where the canonical spelling of the encoding
suffix is different (eg "UTF-8").  I'm disinclined to take that risk
without positive proof that it helps on some platform.

My guess about what was actually happening on your machine is that
the underlying locale data only exists in iso-8859-9 form, and that
glibc normally translates that to utf8 on-the-fly when it's demanded
... but if the data isn't installed at all, nothing happens.

On my Red Hat platforms, this installation choice seems pretty
all-or-nothing, but it sounds like Debian lets you chop it up
more finely (and maybe slice off your finger while at it :-().

regards, tom lane




Re: Allow to_date() and to_timestamp() to accept localized names

2020-03-08 Thread James Coleman
On Sun, Mar 8, 2020 at 10:42 AM James Coleman  wrote:

> On Sun, Mar 8, 2020 at 7:17 AM Juan José Santamaría Flecha <
> juanjo.santama...@gmail.com> wrote:
>
>>
>>
>> On Sun, Mar 8, 2020 at 3:48 AM Tom Lane  wrote:
>>
>>> James Coleman  writes:
>>> > On Sat, Mar 7, 2020 at 9:31 PM Tom Lane  wrote:
>>> >> Looks like you may not have Turkish locale installed?  Try
>>> >> locale -a | grep tr_TR
>>>
>>> > Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I
>>> assume the
>>> > utf8 version is acceptable? Or is there a non-utf8 variant?
>>>
>>> Hmm ... I'm far from an expert on the packaging of locale data, but
>>> the simplest explanation I can think of is that the tr_TR locale exists
>>> to some extent on your machine but the LC_TIME component of that is
>>> missing.
>>>
>>
>>  AFAICS, the locale 'tr_TR' uses the encoding ISO-8859-9 (LATIN5), is not
>> the same as 'tr_TR.utf8'.
>>
>
> The test name implies it's about utf8, though, which makes me wonder if
> the test should be testing utf8 instead?
>
> That being said, a bit more googling based on your node about the proper
> ISO encoding turned up this page: https://unix.stackexchange.com/a/446762
>
> And I confirmed that the locale you mentioned is available:
> $ grep "tr_TR" /usr/share/i18n/SUPPORTED
> tr_TR.UTF-8 UTF-8
> tr_TR ISO-8859-9
>
> So I tried:
> echo tr_TR.ISO-8859-9 >> /var/lib/locales/supported.d/local # In a root
> session
> sudo dpkg-reconfigure locales
>
> That didn't seem to fix it, though `locale -a` still only lists
> tr_TR.utf8, so I'm still at a loss, and also unclear why a test names utf8
> is actually relying on an ISO encoding.
>

Another update:

Since sudo dpkg-reconfigure locales opens up an ncurses gui on my machine,
I tried selecting the tr_TR.ISO-8859-9 option there and removed the
/var/lib/locales/supported.d/local file. Now I get:

$ locale -a | grep tr_TR
tr_TR
tr_TR.iso88599
tr_TR.utf8

And now `make check` passes.

I'm still interested in understanding why we're using the ISO locale
instead of the utf8 one in a utf8-labeled test though.

Thanks,
James


Re: Allow to_date() and to_timestamp() to accept localized names

2020-03-08 Thread James Coleman
On Sun, Mar 8, 2020 at 7:17 AM Juan José Santamaría Flecha <
juanjo.santama...@gmail.com> wrote:

>
>
> On Sun, Mar 8, 2020 at 3:48 AM Tom Lane  wrote:
>
>> James Coleman  writes:
>> > On Sat, Mar 7, 2020 at 9:31 PM Tom Lane  wrote:
>> >> Looks like you may not have Turkish locale installed?  Try
>> >> locale -a | grep tr_TR
>>
>> > Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume
>> the
>> > utf8 version is acceptable? Or is there a non-utf8 variant?
>>
>> Hmm ... I'm far from an expert on the packaging of locale data, but
>> the simplest explanation I can think of is that the tr_TR locale exists
>> to some extent on your machine but the LC_TIME component of that is
>> missing.
>>
>
>  AFAICS, the locale 'tr_TR' uses the encoding ISO-8859-9 (LATIN5), is not
> the same as 'tr_TR.utf8'.
>

The test name implies it's about utf8, though, which makes me wonder if the
test should be testing utf8 instead?

That being said, a bit more googling based on your node about the proper
ISO encoding turned up this page: https://unix.stackexchange.com/a/446762

And I confirmed that the locale you mentioned is available:
$ grep "tr_TR" /usr/share/i18n/SUPPORTED
tr_TR.UTF-8 UTF-8
tr_TR ISO-8859-9

So I tried:
echo tr_TR.ISO-8859-9 >> /var/lib/locales/supported.d/local # In a root
session
sudo dpkg-reconfigure locales

That didn't seem to fix it, though `locale -a` still only lists tr_TR.utf8,
so I'm still at a loss, and also unclear why a test names utf8 is actually
relying on an ISO encoding.

James


Re: Allow to_date() and to_timestamp() to accept localized names

2020-03-08 Thread James Coleman
On Sat, Mar 7, 2020 at 9:48 PM Tom Lane  wrote:

> James Coleman  writes:
> > On Sat, Mar 7, 2020 at 9:31 PM Tom Lane  wrote:
> >> Looks like you may not have Turkish locale installed?  Try
> >> locale -a | grep tr_TR
>
> > Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume
> the
> > utf8 version is acceptable? Or is there a non-utf8 variant?
>
> Hmm ... I'm far from an expert on the packaging of locale data, but
> the simplest explanation I can think of is that the tr_TR locale exists
> to some extent on your machine but the LC_TIME component of that is
> missing.
>
> Do you get different results from "date" depending on the locale?
> I get
>
> $ LANG=C date
> Sat Mar  7 21:44:24 EST 2020
> $ LANG=tr_TR.utf8 date
> Cts Mar  7 21:44:26 EST 2020
>

$ LANG=C date
Sun Mar  8 09:27:50 EDT 2020
$ LANG=tr_TR.utf8 date
Paz Mar  8 09:27:54 EDT 2020
$ LANG=tr_TR date
Sun Mar  8 09:27:57 EDT 2020

I'm curious if you get something different for that last one (no utf8
qualifier).

on my Fedora 30 box.
>
> Another possibility perhaps is that you have partial locale settings
> in your environment that are bollixing the test.  Try
>
> $ env | grep ^LANG
>

This gives me:
LANG=en_US.UTF-8
LANGUAGE=en_US


> $ env | grep ^LC_
>

Nothing for this.


> If there's more than one relevant environment setting, and they
> don't all agree, I'm not sure what would happen with our
> regression tests.
>

There's LANG and LANGUAGE but they look like they effectively agree to me?


> BTW, what platform are you using anyway?
>

ElementaryOS 5.1 Hera, which is built on top of Ubuntu 18.04.3 LTS (and
Linux 4.15.0-72-generic).

This suggests I have the standard Ubuntu Turkish language packages
installed:

$ dpkg -l | grep language-pack-tr
ii  language-pack-tr  1:18.04+20180712
   all  translation updates for language Turkish
ii  language-pack-tr-base 1:18.04+20180712
   all  translations for language Turkish

James


Re: Allow to_date() and to_timestamp() to accept localized names

2020-03-08 Thread Juan José Santamaría Flecha
On Sun, Mar 8, 2020 at 3:48 AM Tom Lane  wrote:

> James Coleman  writes:
> > On Sat, Mar 7, 2020 at 9:31 PM Tom Lane  wrote:
> >> Looks like you may not have Turkish locale installed?  Try
> >> locale -a | grep tr_TR
>
> > Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume
> the
> > utf8 version is acceptable? Or is there a non-utf8 variant?
>
> Hmm ... I'm far from an expert on the packaging of locale data, but
> the simplest explanation I can think of is that the tr_TR locale exists
> to some extent on your machine but the LC_TIME component of that is
> missing.
>

 AFAICS, the locale 'tr_TR' uses the encoding ISO-8859-9 (LATIN5), is not
the same as 'tr_TR.utf8'.


> BTW, what platform are you using anyway?
>

I have just checked in a Debian Stretch

Regards,

Juan José Santamaría Flecha


Re: Allow to_date() and to_timestamp() to accept localized names

2020-03-07 Thread Tom Lane
James Coleman  writes:
> On Sat, Mar 7, 2020 at 9:31 PM Tom Lane  wrote:
>> Looks like you may not have Turkish locale installed?  Try
>> locale -a | grep tr_TR

> Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume the
> utf8 version is acceptable? Or is there a non-utf8 variant?

Hmm ... I'm far from an expert on the packaging of locale data, but
the simplest explanation I can think of is that the tr_TR locale exists
to some extent on your machine but the LC_TIME component of that is
missing.

Do you get different results from "date" depending on the locale?
I get

$ LANG=C date
Sat Mar  7 21:44:24 EST 2020
$ LANG=tr_TR.utf8 date
Cts Mar  7 21:44:26 EST 2020

on my Fedora 30 box.

Another possibility perhaps is that you have partial locale settings
in your environment that are bollixing the test.  Try

$ env | grep ^LANG
$ env | grep ^LC_

If there's more than one relevant environment setting, and they
don't all agree, I'm not sure what would happen with our
regression tests.

BTW, what platform are you using anyway?

regards, tom lane




Re: Allow to_date() and to_timestamp() to accept localized names

2020-03-07 Thread James Coleman
On Sat, Mar 7, 2020 at 9:31 PM Tom Lane  wrote:

> James Coleman  writes:
> > On master with a clean build (and configure re-run) and a fresh init-db,
> > I'm seeing the collate.linux.utf8 test fail with the attached diff.
>
>  -- to_char
>  SET lc_time TO 'tr_TR';
> +ERROR:  invalid value for parameter "lc_time": "tr_TR"
>  SELECT to_char(date '2010-02-01', 'DD TMMON ');
>
> Looks like you may not have Turkish locale installed?  Try
>
> locale -a | grep tr_TR
>
> If you don't see "tr_TR.utf8" or some variant spelling of that,
> the collate.linux.utf8 test is not gonna pass.  The required
> package is probably some sub-package of glibc.
>
> A workaround if you don't want to install more stuff is to run the
> regression tests in C locale, so that that test script gets skipped.
>
> regards, tom lane
>

Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume the
utf8 version is acceptable? Or is there a non-utf8 variant?

Thanks,
James


Re: Allow to_date() and to_timestamp() to accept localized names

2020-03-07 Thread Tom Lane
James Coleman  writes:
> On master with a clean build (and configure re-run) and a fresh init-db,
> I'm seeing the collate.linux.utf8 test fail with the attached diff.

 -- to_char
 SET lc_time TO 'tr_TR';
+ERROR:  invalid value for parameter "lc_time": "tr_TR"
 SELECT to_char(date '2010-02-01', 'DD TMMON ');

Looks like you may not have Turkish locale installed?  Try

locale -a | grep tr_TR

If you don't see "tr_TR.utf8" or some variant spelling of that,
the collate.linux.utf8 test is not gonna pass.  The required
package is probably some sub-package of glibc.

A workaround if you don't want to install more stuff is to run the
regression tests in C locale, so that that test script gets skipped.

regards, tom lane




Re: Allow to_date() and to_timestamp() to accept localized names

2020-03-07 Thread James Coleman
On Tue, Mar 3, 2020 at 11:09 AM Tom Lane  wrote:

> =?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <
> juanjo.santama...@gmail.com> writes:
> > On Sun, Mar 1, 2020 at 11:25 PM Tom Lane  wrote:
> >> Going once, going twice ...
>
> > You have moved this to better place, so none from me, and thank you.
>
> Pushed then.
>

On master with a clean build (and configure re-run) and a fresh init-db,
I'm seeing the collate.linux.utf8 test fail with the attached diff.

Is there something I need to reconfigure, install on my machine (elementary
os, so based on Ubuntu) for this to pass?

Thanks,
James


regression.diffs
Description: Binary data


Re: Allow to_date() and to_timestamp() to accept localized names

2020-03-03 Thread Tom Lane
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?=  
writes:
> On Sun, Mar 1, 2020 at 11:25 PM Tom Lane  wrote:
>> Going once, going twice ...

> You have moved this to better place, so none from me, and thank you.

Pushed then.

While looking over the patch one more time, I noticed some pretty poor
English in docs and error messages for the related jsonb code, so
I took it on myself to fix that while at it.

regards, tom lane




Re: Allow to_date() and to_timestamp() to accept localized names

2020-03-02 Thread Juan José Santamaría Flecha
On Sun, Mar 1, 2020 at 11:25 PM Tom Lane  wrote:

>
> > Barring objections, this seems
> > committable to me.
>
> Going once, going twice ...
>

You have moved this to better place, so none from me, and thank you.

Regards,

Juan José Santamaría Flecha


Re: Allow to_date() and to_timestamp() to accept localized names

2020-03-01 Thread Tom Lane
I wrote:
> * I don't think it's appropriate to hard-wire DEFAULT_COLLATION_OID
> as the collation to do case-folding with.  For to_date/to_timestamp,
> we have collatable text input so we can rely on the function's input
> collation instead, at the cost of having to pass down the collation
> OID through a few layers of subroutines :-(.  For parse_datetime,
> I punted for now and let it use DEFAULT_COLLATION_OID, because that's
> currently only called by JSONB code that probably hasn't got a usable
> input collation anyway (since jsonb isn't considered collatable).

On closer look, it's probably a wise idea to change the signature
of parse_datetime() to include a collation argument, because that
function is new in v13 so there's no API-break argument against it.
It will never be cheaper to change it than today.  So v11 below
does that, pushing the use of DEFAULT_COLLATION_OID into the
json-specific code.  Perhaps somebody else would like to look at
whether there's something brighter for that code to do, but I still
suspect there isn't, so I didn't chase it further.

> Barring objections, this seems
> committable to me.

Going once, going twice ...

regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 28035f1..8b73e05 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5968,7 +5968,7 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');


 TM prefix
-translation mode (print localized day and month names based on
+translation mode (use localized day and month names based on
  )
 TMMonth

@@ -5999,9 +5999,20 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
 
  
   
-   TM does not include trailing blanks.
-   to_timestamp and to_date ignore
-   the TM modifier.
+   TM suppresses trailing blanks whether or
+   not FM is specified.
+  
+ 
+
+ 
+  
+   to_timestamp and to_date
+   ignore letter case in the input; so for
+   example MON, Mon,
+   and mon all accept the same strings.  When using
+   the TM modifier, case-folding is done according to
+   the rules of the function's input collation (see
+   ).
   
  
 
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index d029468..95f7d05 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1038,7 +1038,7 @@ static void parse_format(FormatNode *node, const char *str, const KeyWord *kw,
 static void DCH_to_char(FormatNode *node, bool is_interval,
 		TmToChar *in, char *out, Oid collid);
 static void DCH_from_char(FormatNode *node, const char *in, TmFromChar *out,
-		  bool std, bool *have_error);
+		  Oid collid, bool std, bool *have_error);
 
 #ifdef DEBUG_TO_FROM_CHAR
 static void dump_index(const KeyWord *k, const int *index);
@@ -1057,11 +1057,14 @@ static int	from_char_parse_int_len(int *dest, const char **src, const int len,
 	FormatNode *node, bool *have_error);
 static int	from_char_parse_int(int *dest, const char **src, FormatNode *node,
 bool *have_error);
-static int	seq_search(const char *name, const char *const *array, int *len);
+static int	seq_search_ascii(const char *name, const char *const *array, int *len);
+static int	seq_search_localized(const char *name, char **array, int *len,
+ Oid collid);
 static int	from_char_seq_search(int *dest, const char **src,
  const char *const *array,
+ char **localized_array, Oid collid,
  FormatNode *node, bool *have_error);
-static void do_to_timestamp(text *date_txt, text *fmt, bool std,
+static void do_to_timestamp(text *date_txt, text *fmt, Oid collid, bool std,
 			struct pg_tm *tm, fsec_t *fsec, int *fprec,
 			uint32 *flags, bool *have_error);
 static char *fill_str(char *str, int c, int max);
@@ -2457,7 +2460,7 @@ from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_er
  * suitable for comparisons to ASCII strings.
  */
 static int
-seq_search(const char *name, const char *const *array, int *len)
+seq_search_ascii(const char *name, const char *const *array, int *len)
 {
 	unsigned char firstc;
 	const char *const *a;
@@ -2503,8 +2506,89 @@ seq_search(const char *name, const char *const *array, int *len)
 }
 
 /*
- * Perform a sequential search in 'array' for an entry matching the first
- * character(s) of the 'src' string case-insensitively.
+ * Sequentially search an array of possibly non-English words for
+ * a case-insensitive match to the initial character(s) of "name".
+ *
+ * This has the same API as seq_search_ascii(), but we use a more general
+ * case-folding transformation to achieve case-insensitivity.  Case folding
+ * is done per the rules of the collation identified by "collid".
+ *
+ * The array is treated as const, but we don't declare it that way because
+ * the arrays exported by p

Re: Allow to_date() and to_timestamp() to accept localized names

2020-02-29 Thread Tom Lane
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?=  
writes:
> On Tue, Jan 28, 2020 at 9:35 PM Peter Eisentraut <
> peter.eisentr...@2ndquadrant.com> wrote:
>> So I'm withdrawing my concerns with respect to this patch.  As long as
>> it can do a roundtrip conversion with to_char(), it's fine.

> We can avoid issues with non injective case conversion languages with a
> double conversion, so both strings in the comparison end up in the same
> state.
> I propose an upper/lower conversion as in the attached patch.

This seems pretty reasonable to me, with a couple of caveats:

* I don't think it's appropriate to hard-wire DEFAULT_COLLATION_OID
as the collation to do case-folding with.  For to_date/to_timestamp,
we have collatable text input so we can rely on the function's input
collation instead, at the cost of having to pass down the collation
OID through a few layers of subroutines :-(.  For parse_datetime,
I punted for now and let it use DEFAULT_COLLATION_OID, because that's
currently only called by JSONB code that probably hasn't got a usable
input collation anyway (since jsonb isn't considered collatable).

* I think it's likely worthwhile to do a quick check for an exact
match before we go through all these case-folding pushups.  If the
expected use-case is reversing to_char() output, that will win all
the time.  We're probably ahead of the game even if it only matches
a few percent of the time.

Attached v10 incorporates those improvements, plus a bit of
polishing of docs and comments.  Barring objections, this seems
committable to me.

regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 28035f1..8b73e05 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5968,7 +5968,7 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');


 TM prefix
-translation mode (print localized day and month names based on
+translation mode (use localized day and month names based on
  )
 TMMonth

@@ -5999,9 +5999,20 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
 
  
   
-   TM does not include trailing blanks.
-   to_timestamp and to_date ignore
-   the TM modifier.
+   TM suppresses trailing blanks whether or
+   not FM is specified.
+  
+ 
+
+ 
+  
+   to_timestamp and to_date
+   ignore letter case in the input; so for
+   example MON, Mon,
+   and mon all accept the same strings.  When using
+   the TM modifier, case-folding is done according to
+   the rules of the function's input collation (see
+   ).
   
  
 
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index d029468..db99a6b 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1038,7 +1038,7 @@ static void parse_format(FormatNode *node, const char *str, const KeyWord *kw,
 static void DCH_to_char(FormatNode *node, bool is_interval,
 		TmToChar *in, char *out, Oid collid);
 static void DCH_from_char(FormatNode *node, const char *in, TmFromChar *out,
-		  bool std, bool *have_error);
+		  Oid collid, bool std, bool *have_error);
 
 #ifdef DEBUG_TO_FROM_CHAR
 static void dump_index(const KeyWord *k, const int *index);
@@ -1057,11 +1057,14 @@ static int	from_char_parse_int_len(int *dest, const char **src, const int len,
 	FormatNode *node, bool *have_error);
 static int	from_char_parse_int(int *dest, const char **src, FormatNode *node,
 bool *have_error);
-static int	seq_search(const char *name, const char *const *array, int *len);
+static int	seq_search_ascii(const char *name, const char *const *array, int *len);
+static int	seq_search_localized(const char *name, char **array, int *len,
+ Oid collid);
 static int	from_char_seq_search(int *dest, const char **src,
  const char *const *array,
+ char **localized_array, Oid collid,
  FormatNode *node, bool *have_error);
-static void do_to_timestamp(text *date_txt, text *fmt, bool std,
+static void do_to_timestamp(text *date_txt, text *fmt, Oid collid, bool std,
 			struct pg_tm *tm, fsec_t *fsec, int *fprec,
 			uint32 *flags, bool *have_error);
 static char *fill_str(char *str, int c, int max);
@@ -2457,7 +2460,7 @@ from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_er
  * suitable for comparisons to ASCII strings.
  */
 static int
-seq_search(const char *name, const char *const *array, int *len)
+seq_search_ascii(const char *name, const char *const *array, int *len)
 {
 	unsigned char firstc;
 	const char *const *a;
@@ -2503,8 +2506,89 @@ seq_search(const char *name, const char *const *array, int *len)
 }
 
 /*
- * Perform a sequential search in 'array' for an entry matching the first
- * character(s) of the 'src' string case-insensitively.
+ * Sequentially search an array of possibly non-English words for

Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-31 Thread Juan José Santamaría Flecha
On Tue, Jan 28, 2020 at 9:35 PM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2020-01-28 16:47, Juan José Santamaría Flecha wrote:
> > This patch targets to do something symmetrical to to_char(), which will
> > just return a single value.
>
> I didn't fully realize while reading this thread that to_char() already
> supports localized output and this patch indeed just wants to do the
> opposite.
>
> So I'm withdrawing my concerns with respect to this patch.  As long as
> it can do a roundtrip conversion with to_char(), it's fine.
>
>
We can avoid issues with non injective case conversion languages with a
double conversion, so both strings in the comparison end up in the same
state.

I propose an upper/lower conversion as in the attached patch.

Regards,

Juan José Santamaría Flecha
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index ceda48e..b1951e5 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5968,7 +5968,7 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');


 TM prefix
-translation mode (print localized day and month names based on
+translation mode (use localized day and month names based on
  )
 TMMonth

@@ -6000,8 +6000,13 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
  
   
TM does not include trailing blanks.
+  
+ 
+
+ 
+  
to_timestamp and to_date ignore
-   the TM modifier.
+   the case when receiving names as an input.
   
  
 
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index f58331d..e5b4eb5 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1059,9 +1059,11 @@ static int	from_char_parse_int_len(int *dest, const char **src, const int len,
 	FormatNode *node, bool *have_error);
 static int	from_char_parse_int(int *dest, const char **src, FormatNode *node,
 bool *have_error);
-static int	seq_search(const char *name, const char *const *array, int *len);
+static int	seq_search_ascii(const char *name, const char *const *array, int *len);
+static int	seq_search_localized(const char *name, char **array, int *len);
 static int	from_char_seq_search(int *dest, const char **src,
  const char *const *array,
+ char **localized_array,
  FormatNode *node, bool *have_error);
 static void do_to_timestamp(text *date_txt, text *fmt, bool std,
 			struct pg_tm *tm, fsec_t *fsec, int *fprec,
@@ -2459,7 +2461,7 @@ from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_er
  * suitable for comparisons to ASCII strings.
  */
 static int
-seq_search(const char *name, const char *const *array, int *len)
+seq_search_ascii(const char *name, const char *const *array, int *len)
 {
 	unsigned char firstc;
 	const char *const *a;
@@ -2505,8 +2507,74 @@ seq_search(const char *name, const char *const *array, int *len)
 }
 
 /*
- * Perform a sequential search in 'array' for an entry matching the first
- * character(s) of the 'src' string case-insensitively.
+ * Sequentially search an array of possibly non-English words for
+ * a case-insensitive match to the initial character(s) of "name".
+ *
+ * This has the same API as seq_search_ascii(), but we use a more general
+ * downcasing transformation to achieve case-insensitivity.
+ *
+ * The array is treated as const, but we don't declare it that way because
+ * the arrays exported by pg_locale.c aren't const.
+ */
+static int
+seq_search_localized(const char *name, char **array, int *len)
+{
+	char	  **a;
+	char	   *lower_name;
+	char	   *upper_name;
+
+	*len = 0;
+
+	/* empty string can't match anything */
+	if (!*name)
+		return -1;
+
+	/*
+	 * We do an upper/lower conversion to avoid problems with languages
+	 * in which case conversions are not injective.
+	 */
+	upper_name = str_toupper(unconstify(char *, name), strlen(name),
+			 DEFAULT_COLLATION_OID);
+	lower_name = str_tolower(upper_name, strlen(upper_name),
+			 DEFAULT_COLLATION_OID);
+	pfree(upper_name);
+
+	for (a = array; *a != NULL; a++)
+	{
+		char	   *lower_element;
+		char	   *upper_element;
+		int			element_len;
+
+		/* Upper/lower-case array element, assuming it is normalized */
+		upper_element = str_toupper(*a, strlen(*a), DEFAULT_COLLATION_OID);
+		lower_element = str_tolower(upper_element, strlen(upper_element),
+	DEFAULT_COLLATION_OID);
+		pfree(upper_element);
+		element_len = strlen(lower_element);
+
+		/* Match? */
+		if (strncmp(lower_name, lower_element, element_len) == 0)
+		{
+			*len = element_len;
+			pfree(lower_element);
+			pfree(lower_name);
+			return a - array;
+		}
+		pfree(lower_element);
+	}
+
+	pfree(lower_name);
+	return -1;
+}
+
+/*
+ * Perform a sequential search in 'array' (or 'localized_array', if that's
+ * not NULL) for an entry matching the first character(s) of the 'src'
+ * string case-insensitively.
+ *
+ 

Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-28 Thread Juan José Santamaría Flecha
On Tue, Jan 28, 2020 at 5:21 PM Alvaro Herrera 
wrote:

> On 2020-Jan-28, Peter Eisentraut wrote:
>
> > On 2020-01-28 04:05, Mark Dilger wrote:
> > > German uses both Sonnabend and Samstag for Saturday, so don’t you have
> to compare to a list of values anyway?
> >
> > Yeah, good point.  If it doesn't accept both "Sonnabend" and "Samstag",
> then
> > it's not really usable.
>
> The string "Sonnabend" never appears in the glibc sources, so that will
> certainly not work.  I vote not to care about that, but of course my
> language is not one that has alternate weekday/month names.  I guess if
> we're intent on recognizing alternate names, we'll have to build our own
> list of them :-(
>
> I don't have the ICU sources here to check the status there.
>
>
"Sonnabend" is neither available in ICU.

What is available are both genitive and nominative forms for months, as
reported up thread by Peter. See formats "M" and "L" in:

http://userguide.icu-project.org/formatparse/datetime

Regards,

Juan José Santamaría Flecha


Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-28 Thread Peter Eisentraut

On 2020-01-28 16:47, Juan José Santamaría Flecha wrote:
This patch targets to do something symmetrical to to_char(), which will 
just return a single value.


I didn't fully realize while reading this thread that to_char() already 
supports localized output and this patch indeed just wants to do the 
opposite.


So I'm withdrawing my concerns with respect to this patch.  As long as 
it can do a roundtrip conversion with to_char(), it's fine.


It's pretty clear that this interface cannot be useful for producing or 
parsing fully general localized dates.  But since it exists now (and 
it's quasi SQL standard now), we might as well fill in this gap.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-28 Thread Tom Lane
Mark Dilger  writes:
>> On Jan 28, 2020, at 9:30 AM, Tom Lane  wrote:
>> A brute-force answer, if there are few enough cases, is to recognize
>> them regardless of the specific value of LC_TIME.  Do we think
>> anybody would notice or care if to_date('Sonnabend', 'TMDay') works
>> even when in a non-German locale?

> I think this only becomes a problem if there are weekday or month name 
> collisions between languages where they have different meanings.  As an 
> absurd hypothetical, if “Sunday” in Tagalog means what “Tuesday” means in 
> English, then you’ve got a problem.

> This does happen for month abbreviations.  “Mar” is short for “March” and 
> variations in a number of languages, but short for “November” in Finnish.

> For day abbreviations, “Su” collides between fi_FI and hr_HR, and “tor” 
> collides between sl_SL and no_NO.

But none of those cases are alternative names, so we wouldn't have
entries for them in this hypothetical list.  They'd only be recognized
when strftime() returns them.  I suspect also that the abbreviated
names have few if any alternatives, so we may only need this whole hack
for full names.

A possible way of tightening things up without explicitly decoding the
locale name is to make the entries of the list be string pairs: "if
strftime() returned this, then also consider that".  That puts a bit
of a premium on knowing which names strftime considers primary, but
I think we'd have had to know that anyway.

regards, tom lane




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-28 Thread Mark Dilger



> On Jan 28, 2020, at 9:30 AM, Tom Lane  wrote:
> 
> A brute-force answer, if there are few enough cases, is to recognize
> them regardless of the specific value of LC_TIME.  Do we think
> anybody would notice or care if to_date('Sonnabend', 'TMDay') works
> even when in a non-German locale?

I think this only becomes a problem if there are weekday or month name 
collisions between languages where they have different meanings.  As an absurd 
hypothetical, if “Sunday” in Tagalog means what “Tuesday” means in English, 
then you’ve got a problem.

This does happen for month abbreviations.  “Mar” is short for “March” and 
variations in a number of languages, but short for “November” in Finnish.

For day abbreviations, “Su” collides between fi_FI and hr_HR, and “tor” 
collides between sl_SL and no_NO.

I don’t see any collisions with full month or full weekday names, but I’m also 
only looking at the 53 locales installed in /usr/share/locale/.*/LC_TIME on my 
mac.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-28 Thread Tom Lane
Mark Dilger  writes:
> But then the manual page goes on to say:

>> %E* %O*
>> POSIX locale extensions.  The sequences %Ec %EC %Ex %EX %Ey %EY %Od %Oe %OH 
>> %OI %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy are supposed to provide alternate 
>> representations.
>> 
>> Additionally %OB implemented to represent alternative months names (used 
>> standalone, without day mentioned).

> This is the part I haven’t played with, but it sounds like it can handle at 
> least one alternate name.  Perhaps you can get the alternates this way?

This sounded promising, but the POSIX strftime spec doesn't mention %OB,
so I'm afraid we can't count on it to do much.  At this point I'm not
really convinced that there are no languages with more than two forms,
anyway :-(.

I also wondered whether we could get any further by using strptime() to
convert localized month and day names on-the-fly, rather than the patch's
current approach of re-using strftime() results.  If strptime() fails
to support alternative names, it's their bug not ours.  Unfortunately,
glibc has got said bug (AFAICS anyway), so in practice this would only
offer us plausible deniability and not much of any real functionality.

In the end it seems like we could only handle alternative names by
keeping our own lists of them.  There are probably few enough cases
that that wouldn't be a tremendous maintenance problem, but what
I'm not quite seeing is how we'd decide which list to use when.
Right now, locale identifiers are pretty much opaque to us ... do
we really want to get into the business of recognizing that such a
name refers to German, or Greek?

A brute-force answer, if there are few enough cases, is to recognize
them regardless of the specific value of LC_TIME.  Do we think
anybody would notice or care if to_date('Sonnabend', 'TMDay') works
even when in a non-German locale?

regards, tom lane




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-28 Thread Alvaro Herrera
On 2020-Jan-28, Peter Eisentraut wrote:

> On 2020-01-28 04:05, Mark Dilger wrote:
> > German uses both Sonnabend and Samstag for Saturday, so don’t you have to 
> > compare to a list of values anyway?
> 
> Yeah, good point.  If it doesn't accept both "Sonnabend" and "Samstag", then
> it's not really usable.

The string "Sonnabend" never appears in the glibc sources, so that will
certainly not work.  I vote not to care about that, but of course my
language is not one that has alternate weekday/month names.  I guess if
we're intent on recognizing alternate names, we'll have to build our own
list of them :-(

I don't have the ICU sources here to check the status there.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-28 Thread Mark Dilger



> On Jan 28, 2020, at 7:47 AM, Juan José Santamaría Flecha 
>  wrote:
> 
> This looks like a POSIX feature that some systems might not like (Windows 
> [1]). But if this is something that the patch should aim to, I am fine with a 
> RWF and give it another try in the future.

As long as this implementation doesn’t create a backward-compatibility problem 
when doing a more complete implementation later, I’m fine with this patch not 
tackling the whole problem.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-28 Thread Juan José Santamaría Flecha
On Tue, Jan 28, 2020 at 4:06 PM Mark Dilger 
wrote:

>
> I’m not insisting, just asking about the design.  If it only works with
> one name supported per weekday per language, and per month per language,
> I’m not going to object.  I was just curious if you were going to support
> multiple names anyway, and if that made the question about the Greek case
> folding less pressing.
>
>
This patch targets to do something symmetrical to to_char(), which will
just return a single value.

The issue with the greek locale is that it cannot hold this equivalent
behaviour, as in:

postgres=# select to_date(to_char(now(), 'TMMONTH'), 'TMMONTH');
ERROR:  invalid value "ΙΑΝΟΥΆΡΙΟΣ" for "MONTH"

Because of the particular behaviour sigma (Σσς) casing has, which is also
user visible with a lower().


>
> >  %E* %O*
> >POSIX locale extensions.  The sequences %Ec %EC %Ex %EX %Ey
> %EY %Od %Oe %OH %OI %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy are supposed to
> provide alternate representations.
> >
> >Additionally %OB implemented to represent alternative months
> names (used standalone, without day mentioned).
>
> This is the part I haven’t played with, but it sounds like it can handle
> at least one alternate name.  Perhaps you can get the alternates this way?
>
>
This looks like a POSIX feature that some systems might not like (Windows
[1]). But if this is something that the patch should aim to, I am fine with
a RWF and give it another try in the future.

[1]
https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/strftime-wcsftime-strftime-l-wcsftime-l?view=vs-2019

Regards,

Juan José Santamaría Flecha


Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-28 Thread Mark Dilger



> On Jan 28, 2020, at 6:51 AM, Tom Lane  wrote:
> 
> Peter Eisentraut  writes:
>> On 2020-01-28 04:05, Mark Dilger wrote:
>>> German uses both Sonnabend and Samstag for Saturday, so don’t you have to 
>>> compare to a list of values anyway?
> 
>> Yeah, good point.  If it doesn't accept both "Sonnabend" and "Samstag", 
>> then it's not really usable.
> 
> If we're going to insist on that, then the entire patch is junk.

I’m not insisting, just asking about the design.  If it only works with one 
name supported per weekday per language, and per month per language, I’m not 
going to object.  I was just curious if you were going to support multiple 
names anyway, and if that made the question about the Greek case folding less 
pressing.

> Indeed, I don't even know where we could get the knowledge of which
> name(s) to accept, because strftime is surely only going to tell us
> one of them.

I haven’t played with this yet, but per the manual page for strftime_l:

>  %Ais replaced by national representation of the full weekday name.
> 
>  %ais replaced by national representation of the abbreviated weekday 
> name.
> 
>  %Bis replaced by national representation of the full month name.
> 
>  %bis replaced by national representation of the abbreviated month 
> name.

Which I think gives you just the preferred name, by whatever means the library 
decides which of Sonnabend/Samstag is the preferred name.  But then the manual 
page goes on to say:

>  %E* %O*
>POSIX locale extensions.  The sequences %Ec %EC %Ex %EX %Ey %EY 
> %Od %Oe %OH %OI %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy are supposed to provide 
> alternate representations.
> 
>Additionally %OB implemented to represent alternative months names 
> (used standalone, without day mentioned).

This is the part I haven’t played with, but it sounds like it can handle at 
least one alternate name.  Perhaps you can get the alternates this way?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-28 Thread Tom Lane
Peter Eisentraut  writes:
> On 2020-01-28 04:05, Mark Dilger wrote:
>> German uses both Sonnabend and Samstag for Saturday, so don’t you have to 
>> compare to a list of values anyway?

> Yeah, good point.  If it doesn't accept both "Sonnabend" and "Samstag", 
> then it's not really usable.

If we're going to insist on that, then the entire patch is junk.
Indeed, I don't even know where we could get the knowledge of which
name(s) to accept, because strftime is surely only going to tell us
one of them.

regards, tom lane




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-28 Thread Peter Eisentraut

On 2020-01-28 03:11, Tom Lane wrote:

The other question your example raises is whether we should be trying
to de-accent before comparison, ie was it right for 'Ιανουάριος' to
be treated differently from 'Ιανουαριος'.  I don't know enough Greek
to say, but it kind of feels like that should be outside to_date's
purview.


To clarify, nothing in my examples was meant to imply that de-accenting 
might be necessary.  AFAICT, the accented forms are the only 
linguistically acceptable forms and all the locale libraries accept them 
correctly in their accented forms.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-28 Thread Peter Eisentraut

On 2020-01-28 04:05, Mark Dilger wrote:

German uses both Sonnabend and Samstag for Saturday, so don’t you have to 
compare to a list of values anyway?


Yeah, good point.  If it doesn't accept both "Sonnabend" and "Samstag", 
then it's not really usable.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-27 Thread Mark Dilger



> On Jan 27, 2020, at 6:11 PM, Tom Lane  wrote:
> 
> and I think it
> makes German better (does 'ß' appear in any month/day names there?),
> so maybe we should just roll with that. 

To my knowledge, neither /ß/ nor /ss/ occur in day or month names in German, 
neither presently nor in recent times, so I wouldn’t expect it to matter for 
German whether you use uppercase or lowercase.

> 
> The other question your example raises is whether we should be trying
> to de-accent before comparison, ie was it right for 'Ιανουάριος' to
> be treated differently from 'Ιανουαριος'.  I don't know enough Greek
> to say, but it kind of feels like that should be outside to_date's
> purview.

I’m getting a little lost here.  German uses both Sonnabend and Samstag for 
Saturday, so don’t you have to compare to a list of values anyway?  I don’t 
know Norwegian, but Wikipedia shows both sundag and søndag for Sunday in 
Norwegian Nynorsk.   Faroese seems to have a few similar cases.  Whether those 
languages have alternate day names both in common usage, I can’t say, but both 
Sonnabend and Samstag are still in use in the German speaking world.  If you 
need to compare against lists, then I would think you could put both ιανουάριοσ 
and ιανουάριος into a list, even if only one of them is grammatically correct 
Greek.  I’d think that unaccented versions of names might also go into the 
list, depending on whether speakers of that language consider them equivalent.  
That sounds like something for the translators to hash out.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-27 Thread Tom Lane
Peter Eisentraut  writes:
> For the record, the correct form of that would appear to be
> select to_date('Ιανουάριος', 'TMMonth');
> with the accent.  I had tried different variations of that and they all 
> failed.

OK, so for anyone who is as confused as I was, the main point here
seems to be this: the upper case form of Greek sigma is 'Σ',
and the lower case form is 'σ' ... except as the final letter of
a word, where it is supposed to be written like 'ς'.

If I set lc_collate, lc_ctype, and lc_time to 'el_GR.utf8',
then (on a somewhat hoary glibc platform) I get

u8=# select to_char('2020-01-01'::timestamptz, 'TMMONTH'); 
   to_char
--
 ΙΑΝΟΥΆΡΙΟΣ
(1 row)

u8=# select to_char('2020-01-01'::timestamptz, 'TMMonth');
   to_char
--
 Ιανουάριος
(1 row)

u8=# select to_char('2020-01-01'::timestamptz, 'TMmonth');
   to_char
--
 ιανουάριος
(1 row)

which is correct AFAICS ... but

u8=# select lower(to_char('2020-01-01'::timestamptz, 'TMMONTH'));
lower 
--
 ιανουάριοσ
(1 row)

So what we actually have here, ISTM, is a bug in lower() not to_char().
The bug is unsurprising because str_tolower() simply applies towlower_l()
to each character independently, so there's no way for it to account for
the word-final rule.  I'm not aware that glibc provides any API whereby
that could be done correctly.  On the other hand, we get it right when
using an ICU collation for lower():

u8=# select lower(to_char('2020-01-01'::timestamptz, 'TMMONTH') collate 
"el-gr-x-icu");
lower 
--
 ιανουάριος
(1 row)

because that code path passes the whole string to ICU at once, and
of course getting this right is ICU's entire job.

I haven't double-checked, but I imagine that the reason that to_char
gets the month name case-folding right is that what comes out of
strftime(..."%B"...) is "Ιανουάριος" which we are able to upcase
correctly, while the downcasing code paths don't affect 'ς'.

I thought for a little bit about trying to dodge this issue in the
patch by folding to upper case, not lower, before comparing month/day
names.  I fear that that would just shift the problem cases to some
other language(s).  However, it makes Greek better, and I think it
makes German better (does 'ß' appear in any month/day names there?),
so maybe we should just roll with that.  In the end, it doesn't seem
right to reject this patch just because lower() is broken on some
platforms.

The other question your example raises is whether we should be trying
to de-accent before comparison, ie was it right for 'Ιανουάριος' to
be treated differently from 'Ιανουαριος'.  I don't know enough Greek
to say, but it kind of feels like that should be outside to_date's
purview.

regards, tom lane




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-25 Thread Peter Eisentraut

On 2020-01-24 19:01, Peter Eisentraut wrote:

postgres=# select to_char(now(),'TMmonth');
to_char

   ιανουαρίου
(1 row)

which is the genitive of ιανουάριος.  You use the genitive form for a
date (24th of January) but the nominative otherwise.  But the reverse
mapping can only take one of these forms.  So here

select to_date('Ιανουαριος', 'TMMonth');

fails, which is bad.


For the record, the correct form of that would appear to be

select to_date('Ιανουάριος', 'TMMonth');

with the accent.  I had tried different variations of that and they all 
failed.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-24 Thread Peter Eisentraut

On 2020-01-24 18:25, Juan José Santamaría Flecha wrote:

To illustrate the issue, it does not work as expected:

postgres=# select lower(to_char(now(),'TMMONTH'));
    lower

  ιανουάριοσ
(1 row)

postgres=# select to_char(now(),'TMmonth');
   to_char

  ιανουάριος
(1 row)


Well, this is interesting, because on macOS and Debian stable I get

postgres=# select to_char(now(),'TMmonth');
  to_char

 ιανουαρίου
(1 row)

which is the genitive of ιανουάριος.  You use the genitive form for a 
date (24th of January) but the nominative otherwise.  But the reverse 
mapping can only take one of these forms.  So here


select to_date('Ιανουαριος', 'TMMonth');

fails, which is bad.

In the glibc locale data sources they have both forms listed, but 
apparently the API were are using only accepts one of them.


(I don't know any Greek, I'm just extrapolating from Wikipedia and 
locale data.)


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-24 Thread Juan José Santamaría Flecha
On Fri, Jan 24, 2020 at 5:46 PM Tom Lane  wrote:

> Peter Eisentraut  writes:
> > Looking through the patch quickly, if you want to get Unicode-fancy,
> > doing a case-insensitive comparison by running lower-case on both
> > strings is also wrong in corner cases.  All the Greek month names end in
> > sigma, so I suspect that this patch might not work correctly in such
> cases.
>
> Hm.  That's basically what citext does, and I don't recall hearing
> complaints about that.  What other definition of "case insensitive"
> would you suggest?
>
>
To illustrate the issue, it does not work as expected:

postgres=# select lower(to_char(now(),'TMMONTH'));
   lower

 ιανουάριοσ
(1 row)

postgres=# select to_char(now(),'TMmonth');
  to_char

 ιανουάριος
(1 row)

Regards,

Juan José Santamaría Flecha


Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-24 Thread Tom Lane
Peter Eisentraut  writes:
> Looking through the patch quickly, if you want to get Unicode-fancy, 
> doing a case-insensitive comparison by running lower-case on both 
> strings is also wrong in corner cases.  All the Greek month names end in 
> sigma, so I suspect that this patch might not work correctly in such cases.

Hm.  That's basically what citext does, and I don't recall hearing
complaints about that.  What other definition of "case insensitive"
would you suggest?

regards, tom lane




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-24 Thread Peter Eisentraut

On 2020-01-24 17:22, Tom Lane wrote:

Alvaro Herrera  writes:

But that's a different POV.  The input to this function could come from
arbitrary user input from any application whatsoever.  So the only
reason we can get away with that is because the example regression case
Juan José added (which uses non-normals) does not conform to the
standard.


I'm unsure about "conforming to standard", but I think it's reasonable
to put the onus of doing normalization when necessary on the user.
Otherwise, we need to move normalization logic into basically all
the string processing functions (even texteq), which seems like a
pretty huge cost that will benefit only a small minority of people.
(If it's not a small minority, then where's the bug reports complaining
that we don't do it today?)


These reports do exist, and this behavior is known.  However, the impact 
is mostly that results "look wrong" (looks the same but doesn't compare 
as equal) rather than causing inconsistency and corruption, so it's 
mostly shrugged off.  The nondeterministic collation feature was 
introduced in part to be able to deal with this; the pending 
normalization patch is another.  However, this behavior is baked deeply 
into Unicode, so no single feature or facility will simply make it go away.


AFAICT, we haven't so far had any code that does a lookup of non-ASCII 
strings in a table, so that's why we haven't had this discussion yet.


Now that I think about it, you could also make an argument that this 
should be handled through collation, so the function that looks up the 
string in the locale table should go through texteq.  However, this 
would mostly satisfy the purists but create a bizarre user experience.


Looking through the patch quickly, if you want to get Unicode-fancy, 
doing a case-insensitive comparison by running lower-case on both 
strings is also wrong in corner cases.  All the Greek month names end in 
sigma, so I suspect that this patch might not work correctly in such cases.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-24 Thread Tom Lane
Alvaro Herrera  writes:
> But that's a different POV.  The input to this function could come from
> arbitrary user input from any application whatsoever.  So the only
> reason we can get away with that is because the example regression case
> Juan José added (which uses non-normals) does not conform to the
> standard.

I'm unsure about "conforming to standard", but I think it's reasonable
to put the onus of doing normalization when necessary on the user.
Otherwise, we need to move normalization logic into basically all
the string processing functions (even texteq), which seems like a
pretty huge cost that will benefit only a small minority of people.
(If it's not a small minority, then where's the bug reports complaining
that we don't do it today?)

regards, tom lane




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-24 Thread Alvaro Herrera
On 2020-Jan-24, Peter Eisentraut wrote:

> When you interact with glibc locale functions, you essentially have to
> assume that everything is in NFC.  When using ICU locale functions (which we
> don't here, but just for the sake of argument), then I would expect that ICU
> does appropriate normalization itself when I call
> icu_what_month_is_this(string) returns int.  So I think it is appropriate to
> not deal with normalization in this patch.

But that's a different POV.  The input to this function could come from
arbitrary user input from any application whatsoever.  So the only
reason we can get away with that is because the example regression case
Juan José added (which uses non-normals) does not conform to the
standard.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-24 Thread Peter Eisentraut

On 2020-01-24 12:44, Alvaro Herrera wrote:

On 2020-Jan-24, Juan José Santamaría Flecha wrote:


There is an open patch that will make the normalization functionality user
visible [1]. So, if a user can call to_date(normalize('01 ŞUB 2010'), 'DD
TMMON ') I would vote to drop the normalization logic inside this patch
altogether.


I was reading the SQL standard on this point, and it says this (4.2.8
Universal character sets):

   An SQL-implementation may assume that all UCS strings are normalized
   in one of [Unicode normalization forms].

which seems to agree with what you're saying.


When you interact with glibc locale functions, you essentially have to 
assume that everything is in NFC.  When using ICU locale functions 
(which we don't here, but just for the sake of argument), then I would 
expect that ICU does appropriate normalization itself when I call 
icu_what_month_is_this(string) returns int.  So I think it is 
appropriate to not deal with normalization in this patch.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-24 Thread Tom Lane
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?=  
writes:
> On Thu, Jan 23, 2020 at 11:00 PM Tom Lane  wrote:
>> * It's not exactly apparent to me why this code should be concerned
>> about non-normalized characters when noplace else in the backend is.

> There is an open patch that will make the normalization functionality user
> visible [1]. So, if a user can call to_date(normalize('01 ŞUB 2010'), 'DD
> TMMON ') I would vote to drop the normalization logic inside this patch
> altogether.

Works for me.

> * I have no faith in this calculation that decides how long the match
>> length was:
>>  *len = element_len + name_len - norm_len;

> The proper logic would come from do_to_timestamp() receiving a normalized
> "date_txt" input, so we do not operate with unnormalize and normalize
> strings at the same time.

No, that only solves half the problem, because the downcasing
transformation can change the string length too.  Two easy examples:

* In German, I believe "ß" downcases to "ss".  In Latin-1 encoding that's
a length change, though I think it might accidentally not be in UTF8.

* The Turks distinguish dotted and dotless "i", so that "İ" downcases to
"i", and conversely "I" downcases to "ı".  Those are length changes in
UTF8, though not in whichever Latin-N encoding works for Turkish.

Even if these cases happen not to apply to any month or day name of
the relevant language, we still have a problem, arising from the fact
that we're downcasing the whole remaining string --- so length changes
after the match could occur anyway.

> I would like to rise a couple of questions myself:

> * When compiled with DEBUG_TO_FROM_CHAR, there is a warning "‘dump_node’
> defined but not used". Should we drop this function or uncomment its usage?

Maybe, but I don't think it belongs in this patch.

> * Would it be worth moving str_tolower(localized_name)
> from seq_search_localized() into cache_locale_time()?

I think it'd complicate tracking when that cache has to be invalidated
(i.e. it now depends on more than just LC_TIME).  On the whole I wouldn't
bother unless someone does the measurements to show there'd be a useful
speedup.

regards, tom lane




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-24 Thread Alvaro Herrera
On 2020-Jan-24, Juan José Santamaría Flecha wrote:

> There is an open patch that will make the normalization functionality user
> visible [1]. So, if a user can call to_date(normalize('01 ŞUB 2010'), 'DD
> TMMON ') I would vote to drop the normalization logic inside this patch
> altogether.

I was reading the SQL standard on this point, and it says this (4.2.8
Universal character sets):

  An SQL-implementation may assume that all UCS strings are normalized
  in one of [Unicode normalization forms].

which seems to agree with what you're saying.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-24 Thread Juan José Santamaría Flecha
On Thu, Jan 23, 2020 at 11:00 PM Tom Lane  wrote:

Thank you for your time looking into this.

Here's a v7 patch, rebased over my recent hacking, and addressing
> most of the complaints I raised in <31691.1579648...@sss.pgh.pa.us>.
> However, I've got some new complaints now that I've looked harder
> at the code:
>
> * It's not exactly apparent to me why this code should be concerned
> about non-normalized characters when noplace else in the backend is.
> I think we should either rip that out entirely, or move the logic
> into str_tolower (and hence also str_toupper etc).  I'd tend to favor
> the former, but of course I don't speak any languages where this would
> be an issue.  Perhaps a better idea is to invent a new SQL function
> that normalizes a given string, and expect users to call that first
> if they'd like to_date() to match unnormalized text.
>
>
There is an open patch that will make the normalization functionality user
visible [1]. So, if a user can call to_date(normalize('01 ŞUB 2010'), 'DD
TMMON ') I would vote to drop the normalization logic inside this patch
altogether.

* I have no faith in this calculation that decides how long the match
> length was:
>
> *len = element_len + name_len - norm_len;
>
> I seriously doubt that this does the right thing if either the
> normalization or the downcasing changed the byte length of the
> string.  I'm not actually sure how we can do that correctly.
> There's no good way to know whether the changes happened within
> the part of the "name" string that we matched, or the part beyond
> what we matched, but we only want to count the former.  So this
> seems like a pretty hard problem, and even if this logic is somehow
> correct as it stands, it needs a comment explaining why.
>
>
The proper logic would come from do_to_timestamp() receiving a normalized
"date_txt" input, so we do not operate with unnormalize and normalize
strings at the same time.


> * I'm still concerned about the risk of integer overflow in the
> string allocations in seq_search_localized().  Those need to be
> adjusted to be more paranoid, similar to the code in e.g. str_tolower.
>

Please find attached a patch with the normalization logic removed, thus no
direct allocations in seq_search_localized().

I would like to rise a couple of questions myself:

* When compiled with DEBUG_TO_FROM_CHAR, there is a warning "‘dump_node’
defined but not used". Should we drop this function or uncomment its usage?

* Would it be worth moving str_tolower(localized_name)
from seq_search_localized() into cache_locale_time()?

[1]
https://www.postgresql.org/message-id/014866c8-d7ff-2a4f-c185-cf7e3ceb7028%402ndquadrant.com

Regards,

Juan José Santamaría Flecha


0001-Allow-localized-month-names-to_date-v8.patch
Description: Binary data


Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-23 Thread Tom Lane
Here's a v7 patch, rebased over my recent hacking, and addressing
most of the complaints I raised in <31691.1579648...@sss.pgh.pa.us>.
However, I've got some new complaints now that I've looked harder
at the code:

* It's not exactly apparent to me why this code should be concerned
about non-normalized characters when noplace else in the backend is.
I think we should either rip that out entirely, or move the logic
into str_tolower (and hence also str_toupper etc).  I'd tend to favor
the former, but of course I don't speak any languages where this would
be an issue.  Perhaps a better idea is to invent a new SQL function
that normalizes a given string, and expect users to call that first
if they'd like to_date() to match unnormalized text.

* I have no faith in this calculation that decides how long the match
length was:

*len = element_len + name_len - norm_len;

I seriously doubt that this does the right thing if either the
normalization or the downcasing changed the byte length of the
string.  I'm not actually sure how we can do that correctly.
There's no good way to know whether the changes happened within
the part of the "name" string that we matched, or the part beyond
what we matched, but we only want to count the former.  So this
seems like a pretty hard problem, and even if this logic is somehow
correct as it stands, it needs a comment explaining why.

* I'm still concerned about the risk of integer overflow in the
string allocations in seq_search_localized().  Those need to be
adjusted to be more paranoid, similar to the code in e.g. str_tolower.

regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 6c4359d..ff44496 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5934,7 +5934,7 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');


 TM prefix
-translation mode (print localized day and month names based on
+translation mode (use localized day and month names based on
  )
 TMMonth

@@ -5966,8 +5966,13 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
  
   
TM does not include trailing blanks.
+  
+ 
+
+ 
+  
to_timestamp and to_date ignore
-   the TM modifier.
+   the case when receiving names as an input.
   
  
 
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 792f9ca..2f39050 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -87,6 +87,7 @@
 
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
+#include "common/unicode_norm.h"
 #include "mb/pg_wchar.h"
 #include "parser/scansup.h"
 #include "utils/builtins.h"
@@ -1059,9 +1060,11 @@ static int	from_char_parse_int_len(int *dest, const char **src, const int len,
 	FormatNode *node, bool *have_error);
 static int	from_char_parse_int(int *dest, const char **src, FormatNode *node,
 bool *have_error);
-static int	seq_search(const char *name, const char *const *array, int *len);
+static int	seq_search_ascii(const char *name, const char *const *array, int *len);
+static int	seq_search_localized(const char *name, char **array, int *len);
 static int	from_char_seq_search(int *dest, const char **src,
  const char *const *array,
+ char **localized_array,
  FormatNode *node, bool *have_error);
 static void do_to_timestamp(text *date_txt, text *fmt, bool std,
 			struct pg_tm *tm, fsec_t *fsec, int *fprec,
@@ -2459,7 +2462,7 @@ from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_er
  * suitable for comparisons to ASCII strings.
  */
 static int
-seq_search(const char *name, const char *const *array, int *len)
+seq_search_ascii(const char *name, const char *const *array, int *len)
 {
 	unsigned char firstc;
 	const char *const *a;
@@ -2505,8 +2508,92 @@ seq_search(const char *name, const char *const *array, int *len)
 }
 
 /*
- * Perform a sequential search in 'array' for an entry matching the first
- * character(s) of the 'src' string case-insensitively.
+ * Sequentially search an array of possibly non-English words for
+ * a case-insensitive match to the initial character(s) of "name".
+ *
+ * This has the same API as seq_search_ascii(), but we use a more general
+ * downcasing transformation to achieve case-insensitivity.
+ *
+ * The array is treated as const, but we don't declare it that way because
+ * the arrays exported by pg_locale.c aren't const.
+ */
+static int
+seq_search_localized(const char *name, char **array, int *len)
+{
+	char	  **a;
+	char	   *lower_name;
+	char	   *norm_name;
+	int			name_len;
+	int			norm_len;
+
+	*len = 0;
+
+	/* empty string can't match anything */
+	if (!*name)
+		return -1;
+
+	name_len = strlen(name);
+
+	/* Normalize name, if working in Unicode */
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		pg_wchar   *wchar_name;

Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-23 Thread Tom Lane
Alvaro Herrera  writes:
> Even whitespace is problematic for some languages, such as Afan,

> mon  "Qunxa Garablu";/
>  "Naharsi Kudo";/
>  "Ciggilta Kudo";/

> (etc) but I think whitespace-splitting is going to be more comprehensible
> in the vast majority of cases, even if not perfect.

Interesting.  Given that to_date et al are often willing to ignore
whitespace in input, I wonder whether we won't have some other issues
with names like that --- not so much that basic cases wouldn't work,
as that people might reasonably expect that, say, we'd be flexible
about the amount of whitespace.  Seems like a problem for another day
though.

In the meantime, I agree that "truncate at whitespace" is a better
heuristic for the error messages than what we've got.  I'll go make it so.

regards, tom lane




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-23 Thread Alvaro Herrera
On 2020-Jan-23, Tom Lane wrote:

> That particular case could be improved by stopping at a dash ... but
> since this code is also used to match strings like "A.M.", we can't
> just exclude punctuation in general.  Breaking at whitespace seems
> like a reasonable compromise.

Yeah, and there are cases where dashes are used in names -- browsing
through glibc for example I quickly found Akan, for which the month names are:

mon  "Sanda-ppn";/
 "Kwakwar-gyefuo";/
 "Ebw-benem";/

and so on.  Even whitespace is problematic for some languages, such as Afan,

mon  "Qunxa Garablu";/
 "Naharsi Kudo";/
 "Ciggilta Kudo";/
(etc) but I think whitespace-splitting is going to be more comprehensible
in the vast majority of cases, even if not perfect.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-23 Thread Tom Lane
Alvaro Herrera  writes:
> On 2020-Jan-22, Tom Lane wrote:
>> Arthur Zakirov  writes:
>>> Shouldn't we just show all remaining string instead of truncating it? 

>> That would avoid a bunch of arbitrary decisions, for sure.
>> Anybody have an objection?

> I think it would be clearer to search for whitespace starting at the
> start of the bogus token and stop there.  It might not be perfect,
> particularly if any language has whitespace in a month etc name (I don't
> know any example but I guess it's not impossible for it to exist;
> Portuguese weekday names maybe?), but it seems better than either of the
> behaviors shown above.

I'm okay with that.  It won't work so well for cases like
"1-Januzry-1999", but it's still better than what happens now:

regression=# select to_date('1-Januzry-1999', 'DD MONTH ');
ERROR:  invalid value "Januzry-1" for "MONTH"

That particular case could be improved by stopping at a dash ... but
since this code is also used to match strings like "A.M.", we can't
just exclude punctuation in general.  Breaking at whitespace seems
like a reasonable compromise.

regards, tom lane




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-23 Thread Alvaro Herrera
On 2020-Jan-22, Tom Lane wrote:

> Arthur Zakirov  writes:
> > On 2020/01/23 7:11, Tom Lane wrote:
> >> Closer examination shows that the "max" argument is pretty bogus as
> >> well.  It doesn't do anything except confuse the reader, because there
> >> are no cases where the value passed is less than the maximum array entry
> >> length, so it never acts to change seq_search's behavior.  So we should
> >> just drop that behavior from seq_search, too, and redefine "max" as
> >> having no purpose except to specify how much of the string to show in
> >> error messages.  There's still a question of what that should be for
> >> non-English cases, but at least we now have a clear idea of what we
> >> need the value to do.
> 
> > Shouldn't we just show all remaining string instead of truncating it? 
> 
> That would avoid a bunch of arbitrary decisions, for sure.
> Anybody have an objection?

I think it would be clearer to search for whitespace starting at the
start of the bogus token and stop there.  It might not be perfect,
particularly if any language has whitespace in a month etc name (I don't
know any example but I guess it's not impossible for it to exist;
Portuguese weekday names maybe?), but it seems better than either of the
behaviors shown above.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-22 Thread Tom Lane
Arthur Zakirov  writes:
> On 2020/01/23 7:11, Tom Lane wrote:
>> Closer examination shows that the "max" argument is pretty bogus as
>> well.  It doesn't do anything except confuse the reader, because there
>> are no cases where the value passed is less than the maximum array entry
>> length, so it never acts to change seq_search's behavior.  So we should
>> just drop that behavior from seq_search, too, and redefine "max" as
>> having no purpose except to specify how much of the string to show in
>> error messages.  There's still a question of what that should be for
>> non-English cases, but at least we now have a clear idea of what we
>> need the value to do.

> Shouldn't we just show all remaining string instead of truncating it? 

That would avoid a bunch of arbitrary decisions, for sure.
Anybody have an objection?

regards, tom lane




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-22 Thread Arthur Zakirov

On 2020/01/23 7:11, Tom Lane wrote:

Closer examination shows that the "max" argument is pretty bogus as
well.  It doesn't do anything except confuse the reader, because there
are no cases where the value passed is less than the maximum array entry
length, so it never acts to change seq_search's behavior.  So we should
just drop that behavior from seq_search, too, and redefine "max" as
having no purpose except to specify how much of the string to show in
error messages.  There's still a question of what that should be for
non-English cases, but at least we now have a clear idea of what we
need the value to do.


Shouldn't we just show all remaining string instead of truncating it? 
For example I get the following output:


=# select to_timestamp('3 ЯНВАРЯ 1999', 'DD MONTH ');
ERROR:  invalid value "ЯНВА" for "MONTH"
DETAIL:  The given value did not match any of the allowed values for 
this field.


This behavior is reproduced without the patch though (on Postgres 12).

And the English case might confuse too I think:

=# select to_date('3 JANUZRY 1999', 'DD MONTH ');
ERROR:  invalid value "JANUZRY 1" for "MONTH"
DETAIL:  The given value did not match any of the allowed values for 
this field.


Users might think what means "1" in "JANUZRY 1" string.

I attached the draft patch, which shows all remaining string. So the 
query above will show the following:


=# select to_timestamp('3 ЯНВАРЯ 1999', 'DD MONTH ');
ERROR:  invalid value "ЯНВАРЯ 1999" for "MONTH"
DETAIL:  The remaining value did not match any of the allowed values for 
this field.



The 0001 patch attached cleans up all the above complaints.  I felt
that given the business about scribbling on strings we shouldn't,
it would also be wise to const-ify the string pointers if possible.
That seems not too painful, per 0002 attached.


+1 on the patch.

--
Arthur
diff --git a/src/backend/utils/adt/formatting.c 
b/src/backend/utils/adt/formatting.c
index ef21b7e426..63e773c4c1 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -2539,17 +2539,11 @@ from_char_seq_search(int *dest, const char **src, const 
char *const *array,
 
if (len <= 0)
{
-   charcopy[DCH_MAX_ITEM_SIZ + 1];
-
-   /* Use multibyte-aware truncation to avoid generating a bogus 
string */
-   max = pg_mbcliplen(*src, strlen(*src), max);
-   strlcpy(copy, *src, max + 1);
-
RETURN_ERROR(ereport(ERROR,
 
(errcode(ERRCODE_INVALID_DATETIME_FORMAT),
  errmsg("invalid value 
\"%s\" for \"%s\"",
-copy, 
node->key->name),
- errdetail("The given 
value did not match any of "
+*src, 
node->key->name),
+ errdetail("The 
remaining value did not match any of "

"the allowed values for this field.";
}
*src += len;


Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-22 Thread Tom Lane
I wrote:
> One thing I completely don't understand is why it's sufficient for
> seq_search_localized to do "initcap" semantics.  Shouldn't it cover
> the all-upper and all-lower cases as well, as the existing seq_search
> code does?  (That is, why is it okay to ignore the "type" argument?)

I took a closer look at this and decided you were probably doing that
just because the seq_search code uses initcap-style case folding to
match month and day names, relying on the assumption that the constant
arrays it's passed are cased that way.  But we shouldn't (and the patch
doesn't) assume that the localized names we'll get from pg_locale.c are
cased that way.

However ... it seems to me that the way seq_search is defined is
plenty bogus.  In the first place, it scribbles on its source string,
which is surely not okay.  You can see that in error messages that
are thrown on match failures:

regression=# select to_date('3 JANUZRY 1999', 'DD MONTH ');
ERROR:  invalid value "JanuzRY 1" for "MONTH"
DETAIL:  The given value did not match any of the allowed values for this field.

Now, maybe somebody out there thinks this is a cute way of highlighting
how much of the string we examined, but it looks more like a bug from
where I sit.  It's mere luck that what's being clobbered is a local
string copy created by do_to_timestamp(), and not the original input
data passed to to_date().

In the second place, there's really no need at all for seq_search to
rely on any assumptions about the case state of the array it's
searching.  pg_toupper() is pretty cheap already, and we can make it
guaranteed cheap if we use pg_ascii_toupper() instead.  So I think
we ought to just remove the "type" argument altogether and have
seq_search dynamically convert all the strings it examines to upper
case (or lower case would work as well, at least for English).

> On the other hand, you probably *should* be ignoring the "max"
> argument, because AFAICS the values that are passed for that are
> specific to the English ASCII spellings of the day and month names.
> It seems very likely that they could be too small for some sets of
> non-English names.

Closer examination shows that the "max" argument is pretty bogus as
well.  It doesn't do anything except confuse the reader, because there
are no cases where the value passed is less than the maximum array entry
length, so it never acts to change seq_search's behavior.  So we should
just drop that behavior from seq_search, too, and redefine "max" as
having no purpose except to specify how much of the string to show in
error messages.  There's still a question of what that should be for
non-English cases, but at least we now have a clear idea of what we
need the value to do.

I also noted while I was looking at it that from_char_seq_search()
would truncate at "max" bytes even when dealing with multibyte input.
That has a clear risk of generating an invalidly-encoded error message.

The 0001 patch attached cleans up all the above complaints.  I felt
that given the business about scribbling on strings we shouldn't,
it would also be wise to const-ify the string pointers if possible.
That seems not too painful, per 0002 attached.

I'm tempted to go a bit further than 0001 does, and remove the 'max'
argument from from_char_seq_search() altogether.  Since we only need
'max' in error cases, which don't need to be super-quick, we could drop
the requirement for the caller to specify that and instead compute it
when we do need it as the largest of the constant array's string
lengths.  That'd carry over into the localized-names case in a
reasonably straightforward way (though we might need to count characters
not bytes for that to work nicely).

Because of the risk of incorrectly-encoded error messages, I'm rather
tempted to claim that these patches (or at least 0001) are a bug fix
and should be back-patched.  In any case I think we should apply this
to HEAD and then rebase the localized-names patch over it.  It makes
a whole lot more sense, IMO, for the localized-names comparison logic
to match what this is doing than what seq_search does today.

Comments?

regards, tom lane

diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index ca3c48d..3ef10dc 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -317,10 +317,6 @@ static const char *const numth[] = {"st", "nd", "rd", "th", NULL};
  * Flags & Options:
  * --
  */
-#define ONE_UPPER		1		/* Name */
-#define ALL_UPPER		2		/* NAME */
-#define ALL_LOWER		3		/* name */
-
 #define MAX_MONTH_LEN	9
 #define MAX_MON_LEN		3
 #define MAX_DAY_LEN		9
@@ -1068,9 +1064,9 @@ static int	from_char_parse_int_len(int *dest, char **src, const int len,
 	FormatNode *node, bool *have_error);
 static int	from_char_parse_int(int *dest, char **src, FormatNode *node,
 bool *have_error);
-static int	seq_search(char *name, const char *const *array, int type, int max,

Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-21 Thread Tom Lane
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?=  
writes:
> [ 0001-Allow-localized-month-names-to_date-v6.patch ]

I took a quick look through this.

One thing I completely don't understand is why it's sufficient for
seq_search_localized to do "initcap" semantics.  Shouldn't it cover
the all-upper and all-lower cases as well, as the existing seq_search
code does?  (That is, why is it okay to ignore the "type" argument?)

On the other hand, you probably *should* be ignoring the "max"
argument, because AFAICS the values that are passed for that are
specific to the English ASCII spellings of the day and month names.
It seems very likely that they could be too small for some sets of
non-English names.

Related to that, the business with

+   mb_max = max * pg_encoding_max_length(encoding);
+   name_len = strlen(name);
+   name_len = name_len < mb_max ? name_len : mb_max;

seems wrong even if we assume that "max" is a sane limit on the number of
characters to consider.  This coding is likely to truncate a long "name"
string in the middle of a multibyte character, leading to bad-encoding
errors from later functions.

I also am confused by the "if (mb_max > max ...)" bit.  It looks to me
like that's an obscure way of writing "if (pg_encoding_max_length() > 1)",
which is a rather pointless check considering that the if() then goes on
to insist on encoding == PG_UTF8.

A bit further down, you have an "if (name_wlen > norm_wlen)"
condition that seems pretty fishy.  Is it really guaranteed that
unicode_normalize_kc cannot transform the string without shortening it?
I don't see that specified in its header comment, for sure.

Also, it's purely accidental if this:

norm_name = (char *) palloc((norm_wlen + 1) * sizeof(pg_wchar));

allocates a long enough string for the conversion back to multibyte form,
because that output is not pg_wchars.  I think you want something more
like "norm_wlen * MAX_MULTIBYTE_CHAR_LEN + 1".  (I wonder whether we
need to be worrying about integer overflow in any of this.)

It seems like you could eliminate a lot of messiness by extending
localized_abbrev_days[] and related arrays by one element that's
always NULL.  That would waste, hmm, four 8-byte pointers on typical
machines --- but you're eating a lot more than 32 bytes of code to
pass around the array lengths, plus it's really ugly that the plain
and localized arrays are handled differently.

I don't think the way you're managing the "localized_names" variable
in DCH_from_char is very sensible.  The reader has to do a lot of
reverse engineering just to discover that the value is constant NULL
in most references, something that you've not helped by declaring
it outside the loop rather than inside.  I think you could drop
the variable and write constant NULL in most places, with something
like
S_TM(n->suffix) ? localized_full_days : NULL
in those from_char_seq_search calls where it's relevant.

In general I'm displeased with the lack of attention to comments.
Notably, you added arguments to from_char_seq_search without updating
its header comment ... not that that comment is a great API spec, but
at the least you shouldn't be making it worse.  I think the bug I
complained of above is directly traceable to the lack of a clear spec
here for what "max" means, so failure to keep these comments up to
speed does have demonstrable bad consequences.

regards, tom lane




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-15 Thread Juan José Santamaría Flecha
On Wed, Jan 15, 2020 at 11:20 AM Arthur Zakirov  wrote:

>
> I have some couple picky notes.
>
>
Thanks for the review.


> > + if (name_len != norm_len)
> > + pfree(norm_name);
>
> I'm not sure here. Is it save to assume that if it was allocated a new
> norm_name name_len and norm_len always will differ?
>

Good call, that check can be more robust.


>
> >  static const char *const months_full[] = {
> >   "January", "February", "March", "April", "May", "June", "July",
> > - "August", "September", "October", "November", "December", NULL
> > + "August", "September", "October", "November", "December"
> >  };
>
> Unfortunately I didn't see from the patch why it was necessary to remove
> last null element from months_full, days_short, adbc_strings,
> adbc_strings_long and other arrays. I think it is not good because
> unnecessarily increases the patch and adds code like the following:
>
> > + from_char_seq_search(&value, &s, months,
> localized_names,
> > +
> ONE_UPPER, MAX_MON_LEN, n, have_error,
> > +
> lengthof(months_full));
>
> Here it passes "months" from datetime.c to from_char_seq_search() and
> calculates its length using "months_full" array from formatting.c.
>


With the introduction of seq_search_localized() that can be avoided,
minimizing code churn.

Please, find attached a version addressing the above mentioned.

Regards,

Juan José Santamaría Flecha

>
>


0001-Allow-localized-month-names-to_date-v6.patch
Description: Binary data


Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-15 Thread Arthur Zakirov

Hello!

On 2020/01/13 21:04, Juan José Santamaría Flecha wrote:

Please, find attached a version addressing the above mentioned.


I have some couple picky notes.


+   if (name_len != norm_len)
+   pfree(norm_name);


I'm not sure here. Is it save to assume that if it was allocated a new 
norm_name name_len and norm_len always will differ?



 static const char *const months_full[] = {
"January", "February", "March", "April", "May", "June", "July",
-   "August", "September", "October", "November", "December", NULL
+   "August", "September", "October", "November", "December"
 };


Unfortunately I didn't see from the patch why it was necessary to remove 
last null element from months_full, days_short, adbc_strings, 
adbc_strings_long and other arrays. I think it is not good because 
unnecessarily increases the patch and adds code like the following:



+   from_char_seq_search(&value, &s, months, 
localized_names,
+
ONE_UPPER, MAX_MON_LEN, n, have_error,
+
lengthof(months_full));


Here it passes "months" from datetime.c to from_char_seq_search() and 
calculates its length using "months_full" array from formatting.c.


--
Arthur




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-13 Thread Juan José Santamaría Flecha
On Sat, Jan 11, 2020 at 5:06 PM Tomas Vondra 
wrote:

>
> Thanks. I did a quick review of this patch, and I think it's almost RFC.
>
>
Thanks for reviewing.


> - In func.sgml, it seems we've lost this bit:
>
>
> TM does not include trailing blanks.
> to_timestamp and to_date
> ignore
> the TM modifier.
>
>
>Does that mean the function no longer ignore the TM modifier? That
>would be somewhat problematic (i.e. it might break some user code).
>But looking at the code I don't see why the patch would have this
>effect, so I suppose it's merely a doc bug.
>
>
It is intentional. This patch uses the TM modifier to identify the usage of
localized names as input for to_timestamp() and to_date().


> - I don't think we need to include examples how to_timestmap ignores
>case, I'd say just stating the fact is clear enough. But if we want to
>have examples, I think we should not inline in the para but use the
>established pattern:
>
>
>  Some examples:
> 
> ...
> 
> 
>
>which is used elsewhere in the func.sgml file.
>

I was trying to match the style surrounding the usage notes for date/time
formatting [1]. Agreed that it is not worth an example on its own, so
dropped.


>
> - In formatting.c the "common/unicode_norm.h" should be right after
>includes from "catalog/" to follow the alphabetic order (unless
>there's a reason why that does not work).
>

Fixed.


>
> - I rather dislike the "dim" parameter name, because I immediately think
>"dimension" which makes little sense. I suggest renaming to "nitems"
>or "nelements" or something like that.
>


Agreed, using "nelements" as a better style matchup.

Please, find attached a version addressing the above mentioned.

[1] https://www.postgresql.org/docs/current/functions-formatting.html

Regards,

Juan José Santamaría Flecha

>


0001-Allow-localized-month-names-to_date-v5.patch
Description: Binary data


Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-11 Thread Tomas Vondra

On Thu, Sep 26, 2019 at 08:36:25PM +0200, Juan José Santamaría Flecha wrote:

On Wed, Sep 25, 2019 at 9:57 PM Alvaro Herrera  wrote:


This patch no longer applies.  Can you please rebase?


Thank you for the notification.

The patch rot after commit 1a950f3, a rebased version is attached.



Thanks. I did a quick review of this patch, and I think it's almost RFC.
I only found a couple of minor issue:

- In func.sgml, it seems we've lost this bit:

  
   TM does not include trailing blanks.
   to_timestamp and to_date ignore
   the TM modifier.
  

  Does that mean the function no longer ignore the TM modifier? That
  would be somewhat problematic (i.e. it might break some user code).
  But looking at the code I don't see why the patch would have this
  effect, so I suppose it's merely a doc bug.

- I don't think we need to include examples how to_timestmap ignores
  case, I'd say just stating the fact is clear enough. But if we want to
  have examples, I think we should not inline in the para but use the
  established pattern:

  
Some examples:

...

   

  which is used elsewhere in the func.sgml file.

- In formatting.c the "common/unicode_norm.h" should be right after
  includes from "catalog/" to follow the alphabetic order (unless
  there's a reason why that does not work).

- I rather dislike the "dim" parameter name, because I immediately think
  "dimension" which makes little sense. I suggest renaming to "nitems"
  or "nelements" or something like that. 



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Allow to_date() and to_timestamp() to accept localized names

2019-09-26 Thread Juan José Santamaría Flecha
On Wed, Sep 25, 2019 at 9:57 PM Alvaro Herrera  wrote:
>
> This patch no longer applies.  Can you please rebase?

Thank you for the notification.

The patch rot after commit 1a950f3, a rebased version is attached.

Regards,

Juan José Santamaría Flecha
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 67f1a82..bf9055e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -6578,8 +6578,17 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
  
   
TM does not include trailing blanks.
+  
+ 
+
+ 
+  
to_timestamp and to_date ignore
-   the TM modifier.
+   the case when receiving names as an input.  For example, either
+   to_timestamp('2000-JUN', '-MON') or
+   to_timestamp('2000-Jun', '-MON') or
+   to_timestamp('2000-JUN', '-mon') work and return
+   the same output
   
  
 
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index f7175df..d69d89a 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -97,6 +97,7 @@
 #include "utils/memutils.h"
 #include "utils/numeric.h"
 #include "utils/pg_locale.h"
+#include "common/unicode_norm.h"
 
 /* --
  * Convenience macros for error handling
@@ -220,11 +221,11 @@ typedef struct
  */
 static const char *const months_full[] = {
 	"January", "February", "March", "April", "May", "June", "July",
-	"August", "September", "October", "November", "December", NULL
+	"August", "September", "October", "November", "December"
 };
 
 static const char *const days_short[] = {
-	"Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", NULL
+	"Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"
 };
 
 /* --
@@ -256,8 +257,8 @@ static const char *const days_short[] = {
  * matches for BC have an odd index.  So the boolean value for BC is given by
  * taking the array index of the match, modulo 2.
  */
-static const char *const adbc_strings[] = {ad_STR, bc_STR, AD_STR, BC_STR, NULL};
-static const char *const adbc_strings_long[] = {a_d_STR, b_c_STR, A_D_STR, B_C_STR, NULL};
+static const char *const adbc_strings[] = {ad_STR, bc_STR, AD_STR, BC_STR};
+static const char *const adbc_strings_long[] = {a_d_STR, b_c_STR, A_D_STR, B_C_STR};
 
 /* --
  * AM / PM
@@ -283,8 +284,8 @@ static const char *const adbc_strings_long[] = {a_d_STR, b_c_STR, A_D_STR, B_C_S
  * matches for PM have an odd index.  So the boolean value for PM is given by
  * taking the array index of the match, modulo 2.
  */
-static const char *const ampm_strings[] = {am_STR, pm_STR, AM_STR, PM_STR, NULL};
-static const char *const ampm_strings_long[] = {a_m_STR, p_m_STR, A_M_STR, P_M_STR, NULL};
+static const char *const ampm_strings[] = {am_STR, pm_STR, AM_STR, PM_STR};
+static const char *const ampm_strings_long[] = {a_m_STR, p_m_STR, A_M_STR, P_M_STR};
 
 /* --
  * Months in roman-numeral
@@ -293,10 +294,10 @@ static const char *const ampm_strings_long[] = {a_m_STR, p_m_STR, A_M_STR, P_M_S
  * --
  */
 static const char *const rm_months_upper[] =
-{"XII", "XI", "X", "IX", "VIII", "VII", "VI", "V", "IV", "III", "II", "I", NULL};
+{"XII", "XI", "X", "IX", "VIII", "VII", "VI", "V", "IV", "III", "II", "I"};
 
 static const char *const rm_months_lower[] =
-{"xii", "xi", "x", "ix", "viii", "vii", "vi", "v", "iv", "iii", "ii", "i", NULL};
+{"xii", "xi", "x", "ix", "viii", "vii", "vi", "v", "iv", "iii", "ii", "i"};
 
 /* --
  * Roman numbers
@@ -1068,10 +1069,11 @@ static int	from_char_parse_int_len(int *dest, char **src, const int len,
 	FormatNode *node, bool *have_error);
 static int	from_char_parse_int(int *dest, char **src, FormatNode *node,
 bool *have_error);
-static int	seq_search(char *name, const char *const *array, int type, int max, int *len);
+static int	seq_search_sqlascii(char *name, const char *const *array, int type, int max, int *len, int dim);
+static int	seq_search_localized(char *name, char **array, int max, int *len, int dim);
 static int	from_char_seq_search(int *dest, char **src,
- const char *const *array, int type, int max,
- FormatNode *node, bool *have_error);
+ const char *const *array, char **localized_array, int type, int max,
+ FormatNode *node, bool *have_error, int dim);
 static void do_to_timestamp(text *date_txt, text *fmt, bool std,
 			struct pg_tm *tm, fsec_t *fsec, int *fprec,
 			uint32 *flags, bool *have_error);
@@ -2454,17 +2456,18 @@ from_char_parse_int(int *dest, char **src, FormatNode *node, bool *have_error)
 }
 
 /* --
- * Sequential search with to upper/lower conversion
+ * Sequential search with to upper/lower conversion for SQL_ASCII array input
  * --
  */
 static int
-seq_search(char *name, const char *const *array, int type, int max, int *len)
+seq_search_sqlascii(char *name, const char *const *array, int type, int max, int *len, int dim)
 {
 	const char *p;
 	const char 

Re: Allow to_date() and to_timestamp() to accept localized names

2019-09-25 Thread Alvaro Herrera
This patch no longer applies.  Can you please rebase?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow to_date() and to_timestamp() to accept localized names

2019-09-22 Thread Juan José Santamaría Flecha
On Wed, Sep 18, 2019 at 11:09 AM Juan José Santamaría Flecha
 wrote:
>
> On Fri, Sep 13, 2019 at 10:31 PM Alvaro Herrera
>  wrote:
> >
> Thanks for taking a look at this.
>
> > I'm confused why we acquire the MONTH_DIM / etc definitions.  Can't we
> > just use lengthof() of the corresponding array?  AFAICS it should work
> > just as well.
> >
>
> It was because of the length difference between ascii-name arrays,
> which were all null-ended, and localized-name arrays. The attached
> version uses lengthof().
>
> > I wonder if the "compare first char" thing (seq_search_localized) really
> > works when there are multibyte chars in the day/month names.  I think
> > the code compares just the first char ... but what if the original
> > string uses those funny Unicode non-normalized letters and the locale
> > translation uses normalized letters?  My guess is that the first-char
> > comparison will fail, but surely you'll want the name to match.
> > (There's no month/day name in Spanish that doesn't start with an ASCII
> > letter, but I bet there are some in other languages.)  I think the
> > localized comparison should avoid the first-char optimization, just
> > compare the whole string all the time, and avoid possible weird issues.
>
> The localized search is reformulated in this version to deal with
> multibyte normalization. A regression test for this issue is included.

This version is updated to optimize the need for dynamic allocation.


> Regards,
>
> Juan José Santamaría Flecha
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 4e3e213..c470c0e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -6415,8 +6415,17 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
  
   
TM does not include trailing blanks.
+  
+ 
+
+ 
+  
to_timestamp and to_date ignore
-   the TM modifier.
+   the case when receiving names as an input.  For example, either
+   to_timestamp('2000-JUN', '-MON') or
+   to_timestamp('2000-Jun', '-MON') or
+   to_timestamp('2000-JUN', '-mon') work and return
+   the same output.
   
  
 
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 755ca6e..39d2d11 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -96,6 +96,7 @@
 #include "utils/memutils.h"
 #include "utils/numeric.h"
 #include "utils/pg_locale.h"
+#include "common/unicode_norm.h"
 
 /* --
  * Routines type
@@ -174,18 +175,17 @@ typedef struct
 #define CLOCK_24_HOUR		0
 #define CLOCK_12_HOUR		1
 
-
 /* --
  * Full months
  * --
  */
 static const char *const months_full[] = {
 	"January", "February", "March", "April", "May", "June", "July",
-	"August", "September", "October", "November", "December", NULL
+	"August", "September", "October", "November", "December"
 };
 
 static const char *const days_short[] = {
-	"Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", NULL
+	"Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"
 };
 
 /* --
@@ -217,8 +217,8 @@ static const char *const days_short[] = {
  * matches for BC have an odd index.  So the boolean value for BC is given by
  * taking the array index of the match, modulo 2.
  */
-static const char *const adbc_strings[] = {ad_STR, bc_STR, AD_STR, BC_STR, NULL};
-static const char *const adbc_strings_long[] = {a_d_STR, b_c_STR, A_D_STR, B_C_STR, NULL};
+static const char *const adbc_strings[] = {ad_STR, bc_STR, AD_STR, BC_STR};
+static const char *const adbc_strings_long[] = {a_d_STR, b_c_STR, A_D_STR, B_C_STR};
 
 /* --
  * AM / PM
@@ -244,8 +244,8 @@ static const char *const adbc_strings_long[] = {a_d_STR, b_c_STR, A_D_STR, B_C_S
  * matches for PM have an odd index.  So the boolean value for PM is given by
  * taking the array index of the match, modulo 2.
  */
-static const char *const ampm_strings[] = {am_STR, pm_STR, AM_STR, PM_STR, NULL};
-static const char *const ampm_strings_long[] = {a_m_STR, p_m_STR, A_M_STR, P_M_STR, NULL};
+static const char *const ampm_strings[] = {am_STR, pm_STR, AM_STR, PM_STR};
+static const char *const ampm_strings_long[] = {a_m_STR, p_m_STR, A_M_STR, P_M_STR};
 
 /* --
  * Months in roman-numeral
@@ -254,10 +254,10 @@ static const char *const ampm_strings_long[] = {a_m_STR, p_m_STR, A_M_STR, P_M_S
  * --
  */
 static const char *const rm_months_upper[] =
-{"XII", "XI", "X", "IX", "VIII", "VII", "VI", "V", "IV", "III", "II", "I", NULL};
+{"XII", "XI", "X", "IX", "VIII", "VII", "VI", "V", "IV", "III", "II", "I"};
 
 static const char *const rm_months_lower[] =
-{"xii", "xi", "x", "ix", "viii", "vii", "vi", "v", "iv", "iii", "ii", "i", NULL};
+{"xii", "xi", "x", "ix", "viii", "vii", "vi", "v", "iv", "iii", "ii", "i"};
 
 /* --
  * Roman numbers
@@ -975,7 +975,7 @@ static void parse_format(FormatNode *node, const char *str, const KeyWord *kw,
 
 static void DCH_to_char(FormatNode *node, bool is_

Re: Allow to_date() and to_timestamp() to accept localized names

2019-09-18 Thread Juan José Santamaría Flecha
On Fri, Sep 13, 2019 at 10:31 PM Alvaro Herrera
 wrote:
>
Thanks for taking a look at this.

> I'm confused why we acquire the MONTH_DIM / etc definitions.  Can't we
> just use lengthof() of the corresponding array?  AFAICS it should work
> just as well.
>

It was because of the length difference between ascii-name arrays,
which were all null-ended, and localized-name arrays. The attached
version uses lengthof().

> I wonder if the "compare first char" thing (seq_search_localized) really
> works when there are multibyte chars in the day/month names.  I think
> the code compares just the first char ... but what if the original
> string uses those funny Unicode non-normalized letters and the locale
> translation uses normalized letters?  My guess is that the first-char
> comparison will fail, but surely you'll want the name to match.
> (There's no month/day name in Spanish that doesn't start with an ASCII
> letter, but I bet there are some in other languages.)  I think the
> localized comparison should avoid the first-char optimization, just
> compare the whole string all the time, and avoid possible weird issues.

The localized search is reformulated in this version to deal with
multibyte normalization. A regression test for this issue is included.


Regards,

Juan José Santamaría Flecha


0001-Allow-localized-month-names-to_date-v2.patch
Description: Binary data


Re: Allow to_date() and to_timestamp() to accept localized names

2019-09-13 Thread Alvaro Herrera
On 2019-Aug-22, Juan José Santamaría Flecha wrote:

> On Sun, Aug 18, 2019 at 10:42 AM Juan José Santamaría Flecha
>  wrote:
> >
> > Going through the current items in the wiki's todo list, I have been
> > looking into: "Allow to_date () and to_timestamp () to accept
> > localized month names".
> 
> I have gone through a second take on this, trying to give it a better
> shape but it would surely benefit from some review, so I will open an
> item in the commitfest.

I'm confused why we acquire the MONTH_DIM / etc definitions.  Can't we
just use lengthof() of the corresponding array?  AFAICS it should work
just as well.

I wonder if the "compare first char" thing (seq_search_localized) really
works when there are multibyte chars in the day/month names.  I think
the code compares just the first char ... but what if the original
string uses those funny Unicode non-normalized letters and the locale
translation uses normalized letters?  My guess is that the first-char
comparison will fail, but surely you'll want the name to match.
(There's no month/day name in Spanish that doesn't start with an ASCII
letter, but I bet there are some in other languages.)  I think the
localized comparison should avoid the first-char optimization, just
compare the whole string all the time, and avoid possible weird issues.

But then, I'm not 100% sure of this code.  formatting.c is madness
distilled.

Thanks,

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow to_date() and to_timestamp() to accept localized names

2019-08-23 Thread Juan José Santamaría Flecha
On Thu, Aug 22, 2019 at 9:38 PM Juan José Santamaría Flecha
 wrote:
>
> >
> > Going through the current items in the wiki's todo list, I have been
> > looking into: "Allow to_date () and to_timestamp () to accept
> > localized month names".
> >
>
> I have gone through a second take on this, trying to give it a better
> shape but it would surely benefit from some review, so I will open an
> item in the commitfest.

For reviewers, the current aproach for this patch is:

Break seq_search() into two functions:
  * seq_search_sqlascii() that supports seq_search() current usage.
  * seq_search_localized() similar to the previous but supports multibyte input.
To avoid code duplication most of current seq_search() logic has been
moved to a new function str_compare().
from_char_seq_search() is now responsible to choose between
seq_search_sqlascii() or seq_search_localized().
Also, since localized names is not a null terminated array,
seq_search() now receives the dimension as input and terminating nulls
have been removed from static arrays.

The commitfest item is:

https://commitfest.postgresql.org/24/2255/

Regards,

Juan José Santamaría Flecha




Re: Allow to_date() and to_timestamp() to accept localized names

2019-08-22 Thread Juan José Santamaría Flecha
On Sun, Aug 18, 2019 at 10:42 AM Juan José Santamaría Flecha
 wrote:
>
> Going through the current items in the wiki's todo list, I have been
> looking into: "Allow to_date () and to_timestamp () to accept
> localized month names".
>

I have gone through a second take on this, trying to give it a better
shape but it would surely benefit from some review, so I will open an
item in the commitfest.

Regards,

Juan José Santamaría Flecha


0001-Allow-localized-month-names-to_date-v1.patch
Description: Binary data