Re: Fix number skipping in to_number

Tom Lane Wed, 15 Nov 2017 13:29:40 -0800

I wrote:
> Oliver Ford <[email protected]> writes:
>> On Monday, 13 November 2017, Tom Lane <[email protected]> wrote:
>>> I don't follow your concern?  If "$" is not the correct currency
>>> symbol for the locale, we shouldn't accept it as a match to an L format.
>>> Your patch is tightening what we will accept as a match to a G format,
>>> so I don't see why you're concerned about backward compatibility in
>>> one case but not the other.


>> It's a guess as to the likely use case. I would imagine that people are
>> likely to use a currency symbol different from the locale, but unlikely to
>> use a different group separator. Others might have a different opinion
>> though.

> Well, if they use a currency symbol different from the locale's, they're
> in trouble anyway because the number of bytes might be different.  In most
> encodings, symbols other than "$" are probably not 1-byte characters.
> At the very least I think we need to constrain it enough that it not
> swallow a fractional character.

After more testing I understood your concern about L_currency_symbol:
in C locale that's " ", not "$" as I naively imagined.  Americans,
at least, would be pretty unhappy if "$1234.56" suddenly stopped matching
"L9999.99".  So it seems like we can't institute a strict matching rule.
However, it is certainly not good that this happens:

regression=# select to_number('1234.56', 'L9999.99');
 to_number 
-----------
    234.56
(1 row)

To me that seems just as bad as having ',' or 'G' eat a digit.

After some reflection I propose that the rule that we want is:

* ',' and 'G' consume input only if it exactly matches the expected
separator.

* Other non-data template patterns consume a number of input characters
equal to the number of characters they'd produce in output, *except* that
these patterns will not consume data characters (digits, signs, decimal
point, comma).

I think that while we are at it we should take some measures to ensure
that "character" in this definition means "character", not "byte".
It is not good that a euro currency symbol might consume an
encoding-dependent number of input characters.

That leads me to the attached patch.  There is more that could be done
here --- in particular, I'd like to see the character-not-byte-count
rule extended to literal text.  But that seems like fit material for
a different patch.

Also, I noticed that in your form of the patch, the strncmp() could read
past the end of the string, possibly resulting in a crash.  So I made it
use the AMOUNT_TEST infrastructure from NUM_numpart_from_char to avoid that.

One other note: I realized that it was only pure luck that your regression
test cases worked in locales where 'G' is decimal point --- they still
gave the same answer, but through a totally different interpretation of
the input.  That did not seem like a good idea, so I adjusted the
regression test to force C locale for the to_number() tests.  I wish we
could use some other locale here, but then it likely wouldn't be portable
to Windows.

                        regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index f901567..35a845c 100644
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*************** SELECT regexp_match('abc01234xyz', '(?:(
*** 5850,5856 ****
      data based on the given value.  Any text that is not a template pattern is
      simply copied verbatim.  Similarly, in an input template string (for the
      other functions), template patterns identify the values to be supplied by
!     the input data string.
     </para>
  
    <para>
--- 5850,5859 ----
      data based on the given value.  Any text that is not a template pattern is
      simply copied verbatim.  Similarly, in an input template string (for the
      other functions), template patterns identify the values to be supplied by
!     the input data string.  If there are characters in the template string
!     that are not template patterns, the corresponding characters in the input
!     data string are simply skipped over (whether or not they are equal to the
!     template string characters).
     </para>
  
    <para>
*************** SELECT regexp_match('abc01234xyz', '(?:(
*** 6176,6188 ****
         Ordinary text is allowed in <function>to_char</function>
         templates and will be output literally.  You can put a substring
         in double quotes to force it to be interpreted as literal text
!        even if it contains pattern key words.  For example, in
         <literal>'"Hello Year "YYYY'</literal>, the <literal>YYYY</literal>
         will be replaced by the year data, but the single <literal>Y</literal> in <literal>Year</literal>
!        will not be.  In <function>to_date</function>, <function>to_number</function>,
!        and <function>to_timestamp</function>, double-quoted strings skip the number of
!        input characters contained in the string, e.g. <literal>"XX"</literal>
!        skips two input characters.
        </para>
       </listitem>
  
--- 6179,6193 ----
         Ordinary text is allowed in <function>to_char</function>
         templates and will be output literally.  You can put a substring
         in double quotes to force it to be interpreted as literal text
!        even if it contains template patterns.  For example, in
         <literal>'"Hello Year "YYYY'</literal>, the <literal>YYYY</literal>
         will be replaced by the year data, but the single <literal>Y</literal> in <literal>Year</literal>
!        will not be.
!        In <function>to_date</function>, <function>to_number</function>,
!        and <function>to_timestamp</function>, literal text and double-quoted
!        strings result in skipping the number of characters contained in the
!        string; for example <literal>"XX"</literal> skips two input characters
!        (whether or not they are <literal>XX</literal>).
        </para>
       </listitem>
  
*************** SELECT regexp_match('abc01234xyz', '(?:(
*** 6485,6490 ****
--- 6490,6506 ----
  
       <listitem>
        <para>
+        In <function>to_number</function>, if non-data template patterns such
+        as <literal>L</literal> or <literal>TH</literal> are used, the
+        corresponding number of input characters are skipped, whether or not
+        they match the template pattern, unless they are data characters
+        (that is, digits, sign, decimal point, or comma).  For
+        example, <literal>TH</literal> would skip two non-data characters.
+       </para>
+      </listitem>
+ 
+      <listitem>
+       <para>
         <literal>V</literal> with <function>to_char</function>
         multiplies the input values by
         <literal>10^<replaceable>n</replaceable></literal>, where
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 50254f2..5afc293 100644
*** a/src/backend/utils/adt/formatting.c
--- b/src/backend/utils/adt/formatting.c
*************** static char *get_last_relevant_decnum(ch
*** 988,994 ****
  static void NUM_numpart_from_char(NUMProc *Np, int id, int input_len);
  static void NUM_numpart_to_char(NUMProc *Np, int id);
  static char *NUM_processor(FormatNode *node, NUMDesc *Num, char *inout,
! 			  char *number, int from_char_input_len, int to_char_out_pre_spaces,
  			  int sign, bool is_to_char, Oid collid);
  static DCHCacheEntry *DCH_cache_getnew(const char *str);
  static DCHCacheEntry *DCH_cache_search(const char *str);
--- 988,994 ----
  static void NUM_numpart_from_char(NUMProc *Np, int id, int input_len);
  static void NUM_numpart_to_char(NUMProc *Np, int id);
  static char *NUM_processor(FormatNode *node, NUMDesc *Num, char *inout,
! 			  char *number, int input_len, int to_char_out_pre_spaces,
  			  int sign, bool is_to_char, Oid collid);
  static DCHCacheEntry *DCH_cache_getnew(const char *str);
  static DCHCacheEntry *DCH_cache_search(const char *str);
*************** get_last_relevant_decnum(char *num)
*** 4232,4237 ****
--- 4232,4245 ----
  	return result;
  }
  
+ /*
+  * These macros are used in NUM_processor() and its subsidiary routines.
+  * OVERLOAD_TEST: true if we've reached end of input string
+  * AMOUNT_TEST(s): true if at least s characters remain in string
+  */
+ #define OVERLOAD_TEST	(Np->inout_p >= Np->inout + input_len)
+ #define AMOUNT_TEST(s)	(Np->inout_p <= Np->inout + (input_len - (s)))
+ 
  /* ----------
   * Number extraction for TO_NUMBER()
   * ----------
*************** NUM_numpart_from_char(NUMProc *Np, int i
*** 4246,4254 ****
  		 (id == NUM_0 || id == NUM_9) ? "NUM_0/9" : id == NUM_DEC ? "NUM_DEC" : "???");
  #endif
  
- #define OVERLOAD_TEST	(Np->inout_p >= Np->inout + input_len)
- #define AMOUNT_TEST(_s) (input_len-(Np->inout_p-Np->inout) >= _s)
- 
  	if (OVERLOAD_TEST)
  		return;
  
--- 4254,4259 ----
*************** NUM_numpart_to_char(NUMProc *Np, int id)
*** 4641,4654 ****
  	++Np->num_curr;
  }
  
  static char *
  NUM_processor(FormatNode *node, NUMDesc *Num, char *inout,
! 			  char *number, int from_char_input_len, int to_char_out_pre_spaces,
  			  int sign, bool is_to_char, Oid collid)
  {
  	FormatNode *n;
  	NUMProc		_Np,
  			   *Np = &_Np;
  
  	MemSet(Np, 0, sizeof(NUMProc));
  
--- 4646,4677 ----
  	++Np->num_curr;
  }
  
+ /*
+  * Skip over "n" input characters, but only if they aren't numeric data
+  */
+ static void
+ NUM_eat_non_data_chars(NUMProc *Np, int n, int input_len)
+ {
+ 	while (n-- > 0)
+ 	{
+ 		if (OVERLOAD_TEST)
+ 			break;				/* end of input */
+ 		if (strchr("0123456789.,+-", *Np->inout_p) != NULL)
+ 			break;				/* it's a data character */
+ 		Np->inout_p += pg_mblen(Np->inout_p);
+ 	}
+ }
+ 
  static char *
  NUM_processor(FormatNode *node, NUMDesc *Num, char *inout,
! 			  char *number, int input_len, int to_char_out_pre_spaces,
  			  int sign, bool is_to_char, Oid collid)
  {
  	FormatNode *n;
  	NUMProc		_Np,
  			   *Np = &_Np;
+ 	const char *pattern;
+ 	int			pattern_len;
  
  	MemSet(Np, 0, sizeof(NUMProc));
  
*************** NUM_processor(FormatNode *node, NUMDesc 
*** 4816,4824 ****
  		if (!Np->is_to_char)
  		{
  			/*
! 			 * Check non-string inout end
  			 */
! 			if (Np->inout_p >= Np->inout + from_char_input_len)
  				break;
  		}
  
--- 4839,4849 ----
  		if (!Np->is_to_char)
  		{
  			/*
! 			 * Check at least one character remains to be scanned.  (In
! 			 * actions below, must use AMOUNT_TEST if we want to read more
! 			 * characters than that.)
  			 */
! 			if (OVERLOAD_TEST)
  				break;
  		}
  
*************** NUM_processor(FormatNode *node, NUMDesc 
*** 4828,4839 ****
  		if (n->type == NODE_TYPE_ACTION)
  		{
  			/*
! 			 * Create/reading digit/zero/blank/sing
  			 *
  			 * 'NUM_S' note: The locale sign is anchored to number and we
  			 * read/write it when we work with first or last number
! 			 * (NUM_0/NUM_9). This is reason why NUM_S missing in follow
! 			 * switch().
  			 */
  			switch (n->key->id)
  			{
--- 4853,4868 ----
  		if (n->type == NODE_TYPE_ACTION)
  		{
  			/*
! 			 * Create/read digit/zero/blank/sign/special-case
  			 *
  			 * 'NUM_S' note: The locale sign is anchored to number and we
  			 * read/write it when we work with first or last number
! 			 * (NUM_0/NUM_9).  This is why NUM_S is missing in switch().
! 			 *
! 			 * Notice the "Np->inout_p++" at the bottom of the loop.  This is
! 			 * why most of the actions advance inout_p one less than you might
! 			 * expect.  In cases where we don't want that increment to happen,
! 			 * a switch case ends with "continue" not "break".
  			 */
  			switch (n->key->id)
  			{
*************** NUM_processor(FormatNode *node, NUMDesc 
*** 4848,4854 ****
  					}
  					else
  					{
! 						NUM_numpart_from_char(Np, n->key->id, from_char_input_len);
  						break;	/* switch() case: */
  					}
  
--- 4877,4883 ----
  					}
  					else
  					{
! 						NUM_numpart_from_char(Np, n->key->id, input_len);
  						break;	/* switch() case: */
  					}
  
*************** NUM_processor(FormatNode *node, NUMDesc 
*** 4872,4881 ****
--- 4901,4914 ----
  							if (IS_FILLMODE(Np->Num))
  								continue;
  						}
+ 						if (*Np->inout_p != ',')
+ 							continue;
  					}
  					break;
  
  				case NUM_G:
+ 					pattern = Np->L_thousands_sep;
+ 					pattern_len = strlen(pattern);
  					if (Np->is_to_char)
  					{
  						if (!Np->num_in)
*************** NUM_processor(FormatNode *node, NUMDesc 
*** 4884,4899 ****
  								continue;
  							else
  							{
! 								int			x = strlen(Np->L_thousands_sep);
! 
! 								memset(Np->inout_p, ' ', x);
! 								Np->inout_p += x - 1;
  							}
  						}
  						else
  						{
! 							strcpy(Np->inout_p, Np->L_thousands_sep);
! 							Np->inout_p += strlen(Np->inout_p) - 1;
  						}
  					}
  					else
--- 4917,4932 ----
  								continue;
  							else
  							{
! 								/* just in case there are MB chars */
! 								pattern_len = pg_mbstrlen(pattern);
! 								memset(Np->inout_p, ' ', pattern_len);
! 								Np->inout_p += pattern_len - 1;
  							}
  						}
  						else
  						{
! 							strcpy(Np->inout_p, pattern);
! 							Np->inout_p += pattern_len - 1;
  						}
  					}
  					else
*************** NUM_processor(FormatNode *node, NUMDesc 
*** 4903,4920 ****
  							if (IS_FILLMODE(Np->Num))
  								continue;
  						}
! 						Np->inout_p += strlen(Np->L_thousands_sep) - 1;
  					}
  					break;
  
  				case NUM_L:
  					if (Np->is_to_char)
  					{
! 						strcpy(Np->inout_p, Np->L_currency_symbol);
! 						Np->inout_p += strlen(Np->inout_p) - 1;
  					}
  					else
! 						Np->inout_p += strlen(Np->L_currency_symbol) - 1;
  					break;
  
  				case NUM_RN:
--- 4936,4968 ----
  							if (IS_FILLMODE(Np->Num))
  								continue;
  						}
! 
! 						/*
! 						 * Because L_thousands_sep typically contains data
! 						 * characters (either '.' or ','), we can't use
! 						 * NUM_eat_non_data_chars here.  Instead skip only if
! 						 * the input matches L_thousands_sep.
! 						 */
! 						if (AMOUNT_TEST(pattern_len) &&
! 							strncmp(Np->inout_p, pattern, pattern_len) == 0)
! 							Np->inout_p += pattern_len - 1;
! 						else
! 							continue;
  					}
  					break;
  
  				case NUM_L:
+ 					pattern = Np->L_currency_symbol;
  					if (Np->is_to_char)
  					{
! 						strcpy(Np->inout_p, pattern);
! 						Np->inout_p += strlen(pattern) - 1;
  					}
  					else
! 					{
! 						NUM_eat_non_data_chars(Np, pg_mbstrlen(pattern), input_len);
! 						continue;
! 					}
  					break;
  
  				case NUM_RN:
*************** NUM_processor(FormatNode *node, NUMDesc 
*** 4949,4956 ****
  						continue;
  
  					if (Np->is_to_char)
  						strcpy(Np->inout_p, get_th(Np->number, TH_LOWER));
! 					Np->inout_p += 1;
  					break;
  
  				case NUM_TH:
--- 4997,5012 ----
  						continue;
  
  					if (Np->is_to_char)
+ 					{
  						strcpy(Np->inout_p, get_th(Np->number, TH_LOWER));
! 						Np->inout_p += 1;
! 					}
! 					else
! 					{
! 						/* All variants of 'th' occupy 2 characters */
! 						NUM_eat_non_data_chars(Np, 2, input_len);
! 						continue;
! 					}
  					break;
  
  				case NUM_TH:
*************** NUM_processor(FormatNode *node, NUMDesc 
*** 4959,4966 ****
  						continue;
  
  					if (Np->is_to_char)
  						strcpy(Np->inout_p, get_th(Np->number, TH_UPPER));
! 					Np->inout_p += 1;
  					break;
  
  				case NUM_MI:
--- 5015,5030 ----
  						continue;
  
  					if (Np->is_to_char)
+ 					{
  						strcpy(Np->inout_p, get_th(Np->number, TH_UPPER));
! 						Np->inout_p += 1;
! 					}
! 					else
! 					{
! 						/* All variants of 'TH' occupy 2 characters */
! 						NUM_eat_non_data_chars(Np, 2, input_len);
! 						continue;
! 					}
  					break;
  
  				case NUM_MI:
*************** NUM_processor(FormatNode *node, NUMDesc 
*** 4977,4982 ****
--- 5041,5051 ----
  					{
  						if (*Np->inout_p == '-')
  							*Np->number = '-';
+ 						else
+ 						{
+ 							NUM_eat_non_data_chars(Np, 1, input_len);
+ 							continue;
+ 						}
  					}
  					break;
  
*************** NUM_processor(FormatNode *node, NUMDesc 
*** 4994,5016 ****
  					{
  						if (*Np->inout_p == '+')
  							*Np->number = '+';
  					}
  					break;
  
  				case NUM_SG:
  					if (Np->is_to_char)
  						*Np->inout_p = Np->sign;
- 
  					else
  					{
  						if (*Np->inout_p == '-')
  							*Np->number = '-';
  						else if (*Np->inout_p == '+')
  							*Np->number = '+';
  					}
  					break;
  
- 
  				default:
  					continue;
  					break;
--- 5063,5093 ----
  					{
  						if (*Np->inout_p == '+')
  							*Np->number = '+';
+ 						else
+ 						{
+ 							NUM_eat_non_data_chars(Np, 1, input_len);
+ 							continue;
+ 						}
  					}
  					break;
  
  				case NUM_SG:
  					if (Np->is_to_char)
  						*Np->inout_p = Np->sign;
  					else
  					{
  						if (*Np->inout_p == '-')
  							*Np->number = '-';
  						else if (*Np->inout_p == '+')
  							*Np->number = '+';
+ 						else
+ 						{
+ 							NUM_eat_non_data_chars(Np, 1, input_len);
+ 							continue;
+ 						}
  					}
  					break;
  
  				default:
  					continue;
  					break;
*************** NUM_processor(FormatNode *node, NUMDesc 
*** 5019,5025 ****
  		else
  		{
  			/*
! 			 * Remove to output char from input in TO_CHAR
  			 */
  			if (Np->is_to_char)
  				*Np->inout_p = n->character;
--- 5096,5107 ----
  		else
  		{
  			/*
! 			 * In TO_CHAR, non-pattern characters in the format are copied to
! 			 * the output.  In TO_NUMBER, we skip one input character for each
! 			 * non-pattern format character, whether or not it matches the
! 			 * format character.  (Currently, that's actually implemented as
! 			 * skipping one input byte per non-pattern format byte, which is
! 			 * wrong...)
  			 */
  			if (Np->is_to_char)
  				*Np->inout_p = n->character;
diff --git a/src/test/regress/expected/numeric.out b/src/test/regress/expected/numeric.out
index 7e55b0e..a96bfc0 100644
*** a/src/test/regress/expected/numeric.out
--- b/src/test/regress/expected/numeric.out
*************** SELECT '' AS to_char_26, to_char('100'::
*** 1219,1224 ****
--- 1219,1225 ----
  
  -- TO_NUMBER()
  --
+ SET lc_numeric = 'C';
  SELECT '' AS to_number_1,  to_number('-34,338,492', '99G999G999');
   to_number_1 | to_number 
  -------------+-----------
*************** SELECT '' AS to_number_13, to_number(' .
*** 1297,1302 ****
--- 1298,1358 ----
                |     -0.01
  (1 row)
  
+ SELECT '' AS to_number_14, to_number('34,50','999,99');
+  to_number_14 | to_number 
+ --------------+-----------
+               |      3450
+ (1 row)
+ 
+ SELECT '' AS to_number_15, to_number('123,000','999G');
+  to_number_15 | to_number 
+ --------------+-----------
+               |       123
+ (1 row)
+ 
+ SELECT '' AS to_number_16, to_number('123456','999G999');
+  to_number_16 | to_number 
+ --------------+-----------
+               |    123456
+ (1 row)
+ 
+ SELECT '' AS to_number_17, to_number('$1234.56','L9,999.99');
+  to_number_17 | to_number 
+ --------------+-----------
+               |   1234.56
+ (1 row)
+ 
+ SELECT '' AS to_number_18, to_number('$1234.56','L99,999.99');
+  to_number_18 | to_number 
+ --------------+-----------
+               |   1234.56
+ (1 row)
+ 
+ SELECT '' AS to_number_19, to_number('$1,234.56','L99,999.99');
+  to_number_19 | to_number 
+ --------------+-----------
+               |   1234.56
+ (1 row)
+ 
+ SELECT '' AS to_number_20, to_number('1234.56','L99,999.99');
+  to_number_20 | to_number 
+ --------------+-----------
+               |   1234.56
+ (1 row)
+ 
+ SELECT '' AS to_number_21, to_number('1,234.56','L99,999.99');
+  to_number_21 | to_number 
+ --------------+-----------
+               |   1234.56
+ (1 row)
+ 
+ SELECT '' AS to_number_22, to_number('42nd', '99th');
+  to_number_22 | to_number 
+ --------------+-----------
+               |        42
+ (1 row)
+ 
+ RESET lc_numeric;
  --
  -- Input syntax
  --
diff --git a/src/test/regress/sql/numeric.sql b/src/test/regress/sql/numeric.sql
index 9675b6e..321c7bd 100644
*** a/src/test/regress/sql/numeric.sql
--- b/src/test/regress/sql/numeric.sql
*************** SELECT '' AS to_char_26, to_char('100'::
*** 788,793 ****
--- 788,794 ----
  
  -- TO_NUMBER()
  --
+ SET lc_numeric = 'C';
  SELECT '' AS to_number_1,  to_number('-34,338,492', '99G999G999');
  SELECT '' AS to_number_2,  to_number('-34,338,492.654,878', '99G999G999D999G999');
  SELECT '' AS to_number_3,  to_number('<564646.654564>', '999999.999999PR');
*************** SELECT '' AS to_number_10, to_number('0'
*** 801,806 ****
--- 802,817 ----
  SELECT '' AS to_number_11, to_number('.-01', 'S99.99');
  SELECT '' AS to_number_12, to_number('.01-', '99.99S');
  SELECT '' AS to_number_13, to_number(' . 0 1-', ' 9 9 . 9 9 S');
+ SELECT '' AS to_number_14, to_number('34,50','999,99');
+ SELECT '' AS to_number_15, to_number('123,000','999G');
+ SELECT '' AS to_number_16, to_number('123456','999G999');
+ SELECT '' AS to_number_17, to_number('$1234.56','L9,999.99');
+ SELECT '' AS to_number_18, to_number('$1234.56','L99,999.99');
+ SELECT '' AS to_number_19, to_number('$1,234.56','L99,999.99');
+ SELECT '' AS to_number_20, to_number('1234.56','L99,999.99');
+ SELECT '' AS to_number_21, to_number('1,234.56','L99,999.99');
+ SELECT '' AS to_number_22, to_number('42nd', '99th');
+ RESET lc_numeric;
  
  --
  -- Input syntax

Re: Fix number skipping in to_number

Reply via email to