Re: Consider \v to the list of whitespace characters in the parser

2023-07-05 Thread Michael Paquier
On Tue, Jul 04, 2023 at 09:28:21AM +0900, Michael Paquier wrote:
> Yeah, thanks.

I have looked again at that this morning, and did not notice any
missing spots, so applied..  Let's see how it goes.
--
Michael


signature.asc
Description: PGP signature


Re: Consider \v to the list of whitespace characters in the parser

2023-07-03 Thread Michael Paquier
On Mon, Jul 03, 2023 at 08:15:03PM -0400, Tom Lane wrote:
> Assuming we don't want to change either of these distinctions,
> the v2 patch looks about right to me.

Yeah, thanks.  Peter, what's your take?
--
Michael


signature.asc
Description: PGP signature


Re: Consider \v to the list of whitespace characters in the parser

2023-07-03 Thread Tom Lane
Michael Paquier  writes:
> On Mon, Jul 03, 2023 at 12:17:10PM +0200, Peter Eisentraut wrote:
>> In scan.l, you might want to ponder horiz_space: Even though \v is clearly
>> not "horizontal space", horiz_space already includes \f, which is also not
>> horizontal IMO.  I think horiz_space is really all space characters except
>> newline characters.  Maybe this should be rephrased.

> And a few lines above, there is a comment from 2000 (3cfdd8f)
> pondering if \f should be handled as a newline, which is kind of
> incorrect anyway?

It looks to me like there are two places where these distinctions
actually matter:

1. Which characters terminate a "--" comment.  Currently that's only
[\n\r] (see {non_newline}).

2. Which characters satisfy the SQL spec's requirement that there be a
newline in the whitespace separating string literals that are to be
concatenated.  Currently, that's also only [\n\r].

Assuming we don't want to change either of these distinctions,
the v2 patch looks about right to me.

regards, tom lane




Re: Consider \v to the list of whitespace characters in the parser

2023-07-03 Thread Michael Paquier
On Mon, Jul 03, 2023 at 12:17:10PM +0200, Peter Eisentraut wrote:
> SQL has "whitespace", which includes any Unicode character with the
> White_Space property (which includes \v), and , which is
> implementation-defined.
> 
> So nothing there speaks against treating \v as a (white)space character in
> the SQL scanner.

Okay, thanks for confirming.  

> In scan.l, you might want to ponder horiz_space: Even though \v is clearly
> not "horizontal space", horiz_space already includes \f, which is also not
> horizontal IMO.  I think horiz_space is really all space characters except
> newline characters.  Maybe this should be rephrased.

And a few lines above, there is a comment from 2000 (3cfdd8f)
pondering if \f should be handled as a newline, which is kind of
incorrect anyway?

FWIW, I agree that horiz_space is confusing in this context because it
does not completely reflect the reality, and \v is not that so adding
it to the existing list felt wrong to me.  Form feed is also not a
newline, from what I understand..  From what the parser tells, there
are two things we want to track to handle comments:
- All space characters, which would be \t\n\r\f\v.
- All space characters that are not newlines, \t\f\v.

I don't really have a better idea this morning than using the
following terms in the parser, changing the surroundings with similar
terms:
-space  [ \t\n\r\f]
-horiz_space[ \t\f]
+space  [ \t\n\r\f\v]
+non_newline_space  [ \t\f\v]

Perhaps somebody has a better idea of split?
--
Michael
diff --git a/src/backend/parser/parse_type.c b/src/backend/parser/parse_type.c
index be75dc6ab0..63b4e96962 100644
--- a/src/backend/parser/parse_type.c
+++ b/src/backend/parser/parse_type.c
@@ -742,7 +742,7 @@ typeStringToTypeName(const char *str, Node *escontext)
 	ErrorContextCallback ptserrcontext;
 
 	/* make sure we give useful error for empty input */
-	if (strspn(str, " \t\n\r\f") == strlen(str))
+	if (strspn(str, " \t\n\r\f\v") == strlen(str))
 		goto fail;
 
 	/*
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index b2216a9eac..0708ba6540 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -213,16 +213,16 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
  * versions of Postgres failed to recognize -- as a comment if the input
  * did not end with a newline.
  *
- * XXX perhaps \f (formfeed) should be treated as a newline as well?
+ * non_newline_space tracks all the other space characters except newlines.
  *
  * XXX if you change the set of whitespace characters, fix scanner_isspace()
  * to agree.
  */
 
-space			[ \t\n\r\f]
-horiz_space		[ \t\f]
-newline			[\n\r]
-non_newline		[^\n\r]
+space[ \t\n\r\f\v]
+non_newline_space	[ \t\f\v]
+newline[\n\r]
+non_newline			[^\n\r]
 
 comment			("--"{non_newline}*)
 
@@ -236,8 +236,8 @@ whitespace		({space}+|{comment})
  */
 
 special_whitespace		({space}+|{comment}{newline})
-horiz_whitespace		({horiz_space}|{comment})
-whitespace_with_newline	({horiz_whitespace}*{newline}{special_whitespace}*)
+non_newline_whitespace	({non_newline_space}|{comment})
+whitespace_with_newline	({non_newline_whitespace}*{newline}{special_whitespace}*)
 
 quote			'
 /* If we see {quote} then {quotecontinue}, the quoted string continues */
@@ -1414,6 +1414,8 @@ unescape_single_char(unsigned char c, core_yyscan_t yyscanner)
 			return '\r';
 		case 't':
 			return '\t';
+		case 'v':
+			return '\v';
 		default:
 			/* check for backslash followed by non-7-bit-ASCII */
 			if (c == '\0' || IS_HIGHBIT_SET(c))
diff --git a/src/backend/parser/scansup.c b/src/backend/parser/scansup.c
index ed67f5f5fe..4f0005a114 100644
--- a/src/backend/parser/scansup.c
+++ b/src/backend/parser/scansup.c
@@ -121,6 +121,7 @@ scanner_isspace(char ch)
 		ch == '\t' ||
 		ch == '\n' ||
 		ch == '\r' ||
+		ch == '\v' ||
 		ch == '\f')
 		return true;
 	return false;
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index cb467ca46f..1cc7fb858c 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -73,7 +73,7 @@ static void addlitchar(unsigned char ychar);
 %x xd
 %x xq
 
-space			[ \t\n\r\f]
+space			[ \t\n\r\f\v]
 
 quote			'
 quotestop		{quote}
diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index 9000f83a83..4359dbd83d 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -24,6 +24,7 @@
 #include "nodes/nodeFuncs.h"
 #include "nodes/supportnodes.h"
 #include "optimizer/optimizer.h"
+#include "parser/scansup.h"
 #include "port/pg_bitutils.h"
 #include "utils/array.h"
 #include "utils/arrayaccess.h"
@@ -89,7 +90,6 @@ typedef struct ArrayIteratorData
 	int			current_item;	/* the item # we're at in the array */
 }			ArrayIteratorData;
 
-static bool array_isspace(char ch);
 static int	ArrayCount(const char *str, int *dim, char typdelim,
 	   Node *escontext);
 

Re: Consider \v to the list of whitespace characters in the parser

2023-07-03 Thread Peter Eisentraut

On 21.06.23 08:45, Michael Paquier wrote:

One thing I was wondering: has the SQL specification anything specific
about the way vertical tabs should be parsed?


SQL has "whitespace", which includes any Unicode character with the 
White_Space property (which includes \v), and , which is 
implementation-defined.


So nothing there speaks against treating \v as a (white)space character 
in the SQL scanner.


In scan.l, you might want to ponder horiz_space: Even though \v is 
clearly not "horizontal space", horiz_space already includes \f, which 
is also not horizontal IMO.  I think horiz_space is really all space 
characters except newline characters.  Maybe this should be rephrased.






Consider \v to the list of whitespace characters in the parser

2023-06-21 Thread Michael Paquier
Hi all,
(Adding Evan in CC as he has reported the original issue with hstore.)

$subject has showed up as a subject for discussion when looking at the
set of whitespace characters that we use in the parsers:
https://www.postgresql.org/message-id/ca+hwa9btrdf52dhyu+jooqealgrgro5uhuytfuduoj3cbfe...@mail.gmail.com

On HEAD, these are \t, \n, \r and \f which is consistent with the list
that we use in scanner_isspace().  

This has quite some history, first in 9ae2661 that dealt with an old
issue with BSD's isspace where whitespaces may not be detected
correctly.  hstore has been recently changed to fix the same problem
with d522b05, still depending on scanner_isspace() for the job makes
the handling of \v kind of strange.

That's not the end of the story.  There is an inconsistency with the
way array values are handled for the same problem, where 95cacd1 added
handling for \v in the list of what's considered a whitespace.

Attached is a patch to bring a bit more consistency across the board,
by adding \v to the set of characters that are considered as
whitespace by the parser.  Here are a few things that I have noticed
in passing:
- JSON should not escape \v, as defined in RFC 7159.
- syncrep_scanner.l already considered \v as a whitespace.  Its
neighbor repl_scanner.l did not do that.
- There are a few more copies that would need a refresh of what is
considered as a whitespace in their respective lex scanners:
psqlscan.l, psqlscanslash.l, cubescan.l, segscan.l, ECPG's pgc.l.

One thing I was wondering: has the SQL specification anything specific
about the way vertical tabs should be parsed?

Thoughts and comments are welcome.
Thanks,
--
Michael
diff --git a/src/backend/parser/parse_type.c b/src/backend/parser/parse_type.c
index be75dc6ab0..63b4e96962 100644
--- a/src/backend/parser/parse_type.c
+++ b/src/backend/parser/parse_type.c
@@ -742,7 +742,7 @@ typeStringToTypeName(const char *str, Node *escontext)
 	ErrorContextCallback ptserrcontext;
 
 	/* make sure we give useful error for empty input */
-	if (strspn(str, " \t\n\r\f") == strlen(str))
+	if (strspn(str, " \t\n\r\f\v") == strlen(str))
 		goto fail;
 
 	/*
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index b2216a9eac..a6c67c7151 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -219,7 +219,7 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
  * to agree.
  */
 
-space			[ \t\n\r\f]
+space			[ \t\n\r\f\v]
 horiz_space		[ \t\f]
 newline			[\n\r]
 non_newline		[^\n\r]
@@ -1414,6 +1414,8 @@ unescape_single_char(unsigned char c, core_yyscan_t yyscanner)
 			return '\r';
 		case 't':
 			return '\t';
+		case 'v':
+			return '\v';
 		default:
 			/* check for backslash followed by non-7-bit-ASCII */
 			if (c == '\0' || IS_HIGHBIT_SET(c))
diff --git a/src/backend/parser/scansup.c b/src/backend/parser/scansup.c
index ed67f5f5fe..4f0005a114 100644
--- a/src/backend/parser/scansup.c
+++ b/src/backend/parser/scansup.c
@@ -121,6 +121,7 @@ scanner_isspace(char ch)
 		ch == '\t' ||
 		ch == '\n' ||
 		ch == '\r' ||
+		ch == '\v' ||
 		ch == '\f')
 		return true;
 	return false;
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index cb467ca46f..1cc7fb858c 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -73,7 +73,7 @@ static void addlitchar(unsigned char ychar);
 %x xd
 %x xq
 
-space			[ \t\n\r\f]
+space			[ \t\n\r\f\v]
 
 quote			'
 quotestop		{quote}
diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index 9000f83a83..4359dbd83d 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -24,6 +24,7 @@
 #include "nodes/nodeFuncs.h"
 #include "nodes/supportnodes.h"
 #include "optimizer/optimizer.h"
+#include "parser/scansup.h"
 #include "port/pg_bitutils.h"
 #include "utils/array.h"
 #include "utils/arrayaccess.h"
@@ -89,7 +90,6 @@ typedef struct ArrayIteratorData
 	int			current_item;	/* the item # we're at in the array */
 }			ArrayIteratorData;
 
-static bool array_isspace(char ch);
 static int	ArrayCount(const char *str, int *dim, char typdelim,
 	   Node *escontext);
 static bool ReadArrayStr(char *arrayStr, const char *origStr,
@@ -254,7 +254,7 @@ array_in(PG_FUNCTION_ARGS)
 		 * Note: we currently allow whitespace between, but not within,
 		 * dimension items.
 		 */
-		while (array_isspace(*p))
+		while (scanner_isspace(*p))
 			p++;
 		if (*p != '[')
 			break;/* no more dimension items */
@@ -338,7 +338,7 @@ array_in(PG_FUNCTION_ARGS)
 	 errdetail("Missing \"%s\" after array dimensions.",
 			   ASSGN)));
 		p += strlen(ASSGN);
-		while (array_isspace(*p))
+		while (scanner_isspace(*p))
 			p++;
 
 		/*
@@ -434,27 +434,6 @@ array_in(PG_FUNCTION_ARGS)
 	PG_RETURN_ARRAYTYPE_P(retval);
 }
 
-/*
- * array_isspace() --- a non-locale-dependent isspace()
- *
- * We used to use isspace() for parsing array values, but that