Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific
On Tue, Oct 10, 2023 at 10:51:10AM -0400, Evan Jones wrote: > Here is a quick demonstration of this issue, showing that the quoting > behavior is different between these two. Mac OS X with the "default" locale > includes quotes because ą includes 0x85 in its UTF-8 encoding: Ugh. rowtypes.c has reminded me as well of gistfuncs.c in pageinspect where included columns are printed in a ROW-like fashion. And it also uses isspace() when we check if double quotes are needed or not. So the use of the quotes would equally depend on what macos thinks is a correct space in this case. -- Michael signature.asc Description: PGP signature
Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific
Thanks for bringing this up! I just looked at the uses if isspace() in that file. It looks like it is the usual thing: it is allowing leading or trailing whitespace when parsing values, or for this "needs quoting" logic on output. The fix would be the same: this *should* be using scanner_isspace. This has the same disadvantage: it would change Postgres's results for some inputs that contain these non-ASCII "space" characters. Here is a quick demonstration of this issue, showing that the quoting behavior is different between these two. Mac OS X with the "default" locale includes quotes because ą includes 0x85 in its UTF-8 encoding: postgres=# SELECT ROW('keyą'); row -- ("keyą") (1 row) On Mac OS X with the LANG=C environment variable set, it does not include quotes: postgres=# SELECT ROW('keyą'); row (keyą) (1 row) On Mon, Oct 9, 2023 at 11:18 PM Thomas Munro wrote: > FTR I ran into a benign case of the phenomenon in this thread when > dealing with row types. In rowtypes.c, we double-quote stuff > containing spaces, but we detect them by passing individual bytes of > UTF-8 sequences to isspace(). Like macOS, Windows thinks that 0xa0 is > a space when you do that, so for example the Korean character '점' > (code point C810, UTF-8 sequence EC A0 90) gets quotes on Windows but > not on Linux. That confused a migration/diff tool while comparing > Windows and Linux database servers using that representation. Not a > big deal, I guess no one ever promised that the format was stable > across platforms, and I don't immediately see a way for anything more > serious to go wrong (though I may lack imagination). It does seem a > bit weird to be using locale-aware tokenising for a machine-readable > format, and then making sure its behaviour is undefined by feeding it > chopped up bytes. >
Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific
FTR I ran into a benign case of the phenomenon in this thread when dealing with row types. In rowtypes.c, we double-quote stuff containing spaces, but we detect them by passing individual bytes of UTF-8 sequences to isspace(). Like macOS, Windows thinks that 0xa0 is a space when you do that, so for example the Korean character '점' (code point C810, UTF-8 sequence EC A0 90) gets quotes on Windows but not on Linux. That confused a migration/diff tool while comparing Windows and Linux database servers using that representation. Not a big deal, I guess no one ever promised that the format was stable across platforms, and I don't immediately see a way for anything more serious to go wrong (though I may lack imagination). It does seem a bit weird to be using locale-aware tokenising for a machine-readable format, and then making sure its behaviour is undefined by feeding it chopped up bytes.
Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific
On Tue, Jun 20, 2023 at 11:39:31PM -0400, Tom Lane wrote: > I'd be okay with adding \v to the set of whitespace characters in > scan.l and scanner_isspace (and other affected places) for v17. > Don't want to back-patch it though. Okay. No idea where this will lead, but for now I have sent a patch that adds \v to the parser paths where it would be needed, as far as I checked: https://www.postgresql.org/message-id/zjkcjnwwhhvw9...@paquier.xyz -- Michael signature.asc Description: PGP signature
Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific
Michael Paquier writes: > As a whole, I'd like to think that this is an improvement even for > stable branches with these weird isspace() handlings, so I'm OK with > the current status in all the branches. Sounds like we're all content with that. > There's an argument about \v, > IMO, but I won't fight hard for it either even if it would be more > consistent with the way array values are handled. I'd be okay with adding \v to the set of whitespace characters in scan.l and scanner_isspace (and other affected places) for v17. Don't want to back-patch it though. regards, tom lane
Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific
On Tue, Jun 20, 2023 at 09:04:26AM -0400, Evan Jones wrote: > My one last suggestion: We *could* revert the backpatching if we are > concerned about this change, but I'm not personally sure that is necessary. > As we discussed, this is an unusual corner case in an "extension" type that > many won't even have enabled. As a whole, I'd like to think that this is an improvement even for stable branches with these weird isspace() handlings, so I'm OK with the current status in all the branches. There's an argument about \v, IMO, but I won't fight hard for it either even if it would be more consistent with the way array values are handled. -- Michael signature.asc Description: PGP signature
Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific
Thanks for the detailed discussion. To confirm that I've understood everything: * Michael's proposed patch to add hstore_isspace() would be a potential fix: it resolves my original bug, and does not change the behavior of '\v'. * We believe the change to '\v' is not a problem, and may be an improvement because it now follows the "core" Postgres parser. In conclusion: we don't need to make an additional change. Thank you all for investigating! My one last suggestion: We *could* revert the backpatching if we are concerned about this change, but I'm not personally sure that is necessary. As we discussed, this is an unusual corner case in an "extension" type that many won't even have enabled. Evan On Tue, Jun 20, 2023 at 2:02 AM Michael Paquier wrote: > On Sun, Jun 18, 2023 at 09:10:59PM -0400, Tom Lane wrote: > > What have you got in mind? We should already have validated encoding > > correctness before the text ever gets to hstore_in, and I'm not clear > > what additional checks would be useful. > > I was staring at the hstore parsing code and got the impression that > multi-byte character handling could be improved, but looking closer it > seems that I got that wrong. Apologies for the noise. > -- > Michael >
Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific
On Sun, Jun 18, 2023 at 09:10:59PM -0400, Tom Lane wrote: > What have you got in mind? We should already have validated encoding > correctness before the text ever gets to hstore_in, and I'm not clear > what additional checks would be useful. I was staring at the hstore parsing code and got the impression that multi-byte character handling could be improved, but looking closer it seems that I got that wrong. Apologies for the noise. -- Michael signature.asc Description: PGP signature
Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific
Michael Paquier writes: > Another thing that I was wondering, though.. Do you think that there > would be an argument in being stricter in the hstore code regarding > the handling of multi-byte characters with some checks based on > IS_HIGHBIT_SET() when parsing the keys and values? What have you got in mind? We should already have validated encoding correctness before the text ever gets to hstore_in, and I'm not clear what additional checks would be useful. regards, tom lane
Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific
On Sun, Jun 18, 2023 at 12:38:12PM -0400, Tom Lane wrote: > FWIW, I think the status quo is fine. Having hstore do something that > is neither its historical behavior nor aligned with the core parser > doesn't seem like a great idea. Okay. Fine by me. > I don't buy this argument that > somebody might be depending on the handling of \v in particular. It's > not any stronger than the argument that they might be depending on, > say, recognizing no-break space (0xA0) in LATIN1, which the old code > did (probably, depending on platform) and scanner_isspace will not. Another thing that I was wondering, though.. Do you think that there would be an argument in being stricter in the hstore code regarding the handling of multi-byte characters with some checks based on IS_HIGHBIT_SET() when parsing the keys and values? -- Michael signature.asc Description: PGP signature
Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific
Michael Paquier writes: > At the end, no need to do that. I have been able to hack the > attached, that shows the difference of treatment for \v when running > in macOS. Evan, what do you think? FWIW, I think the status quo is fine. Having hstore do something that is neither its historical behavior nor aligned with the core parser doesn't seem like a great idea. I don't buy this argument that somebody might be depending on the handling of \v in particular. It's not any stronger than the argument that they might be depending on, say, recognizing no-break space (0xA0) in LATIN1, which the old code did (probably, depending on platform) and scanner_isspace will not. If anything, the answer for these concerns is that d522b05c8 should not have been back-patched. But I'm okay with where we are. regards, tom lane
Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific
On Sun, Jun 18, 2023 at 10:50:16AM +0900, Michael Paquier wrote: > The difference between scanner_isspace() and array_isspace() is that > the former matches with what scan.l stores as rules for whitespace > characters, but the latter works on values. For hstore, we want the > latter, with something that works on values. To keep the change > locale to hstore, I think that we should just introduce an > hstore_isspace() which is a copy of array_isspace. That's a > duplication, sure, but I think that we may want to think harder about > \v in the flex scanner, and that's just a few extra lines for > something that has not changed in 13 years for arrays. That's also > easier to think about for stable branches. If you can send a patch, > that helps a lot, for sure! At the end, no need to do that. I have been able to hack the attached, that shows the difference of treatment for \v when running in macOS. Evan, what do you think? -- Michael From 7ede07940011d08f8c0cf05d57b3782f367c0adf Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Sun, 18 Jun 2023 17:29:37 +0900 Subject: [PATCH] More fixes for hstore and locales This is not really complete without the treatment of \v like array values. --- contrib/hstore/expected/hstore_utf8.out | 31 + contrib/hstore/hstore_io.c | 30 contrib/hstore/sql/hstore_utf8.sql | 7 ++ 3 files changed, 63 insertions(+), 5 deletions(-) diff --git a/contrib/hstore/expected/hstore_utf8.out b/contrib/hstore/expected/hstore_utf8.out index 4405824413..bbc885a181 100644 --- a/contrib/hstore/expected/hstore_utf8.out +++ b/contrib/hstore/expected/hstore_utf8.out @@ -34,3 +34,34 @@ SELECT 'keyąfoo=>valueą'::hstore; "keyąfoo"=>"valueą" (1 row) +-- More patterns that may depend on isspace() and locales, all discarded. +SELECT E'key\u000A=>value\u000A'::hstore; -- \n + hstore + + "key"=>"value" +(1 row) + +SELECT E'key\u0009=>value\u0009'::hstore; -- \t + hstore + + "key"=>"value" +(1 row) + +SELECT E'key\u000D=>value\u000D'::hstore; -- \r + hstore + + "key"=>"value" +(1 row) + +SELECT E'key\u000B=>value\u000B'::hstore; -- \v + hstore + + "key"=>"value" +(1 row) + +SELECT E'key\u000C=>value\u000C'::hstore; -- \f + hstore + + "key"=>"value" +(1 row) + diff --git a/contrib/hstore/hstore_io.c b/contrib/hstore/hstore_io.c index 999ddad76d..f33b241d54 100644 --- a/contrib/hstore/hstore_io.c +++ b/contrib/hstore/hstore_io.c @@ -13,7 +13,6 @@ #include "lib/stringinfo.h" #include "libpq/pqformat.h" #include "nodes/miscnodes.h" -#include "parser/scansup.h" #include "utils/builtins.h" #include "utils/json.h" #include "utils/jsonb.h" @@ -41,6 +40,7 @@ typedef struct int plen; } HSParser; +static bool hstore_isspace(char ch); static bool hstoreCheckKeyLength(size_t len, HSParser *state); static bool hstoreCheckValLength(size_t len, HSParser *state); @@ -82,6 +82,26 @@ prseof(HSParser *state) return false; } +/* + * hstore_isspace() --- a non-locale-dependent isspace() + * + * We used to use isspace() for parsing hstore values, but that has + * undesirable results: a hstore value might be silently interpreted + * differently depending on the locale setting. Now we just hard-wire + * the traditional ASCII definition of isspace(). + */ +static bool +hstore_isspace(char ch) +{ + if (ch == ' ' || + ch == '\t' || + ch == '\n' || + ch == '\r' || + ch == '\v' || + ch == '\f') + return true; + return false; +} #define GV_WAITVAL 0 #define GV_INVAL 1 @@ -119,7 +139,7 @@ get_val(HSParser *state, bool ignoreeq, bool *escaped) { st = GV_WAITESCIN; } - else if (!scanner_isspace((unsigned char) *(state->ptr))) + else if (!hstore_isspace((unsigned char) *(state->ptr))) { *(state->cur) = *(state->ptr); state->cur++; @@ -142,7 +162,7 @@ get_val(HSParser *state, bool ignoreeq, bool *escaped) state->ptr--; return true; } - else if (scanner_isspace((unsigned char) *(state->ptr))) + else if (hstore_isspace((unsigned char) *(state->ptr))) { return true; } @@ -256,7 +276,7 @@ parse_hstore(HSParser *state) { PRSEOF; } - else if (!scanner_isspace((unsigned char) *(state->ptr))) + else if (!hstore_isspace((unsigned char) *(state->ptr))) { PRSSYNTAXERROR; } @@ -310,7 +330,7 @@ parse_hstore(HSParser *state) { return true; } - else if (!scanner_isspace((unsigned char) *(state->ptr))) + else if (!hstore_isspace((unsigned char) *(state->ptr))) { PRSSYNTAXERROR; } diff --git a/contrib/hstore/sql/hstore_utf8.sql b/contrib/hstore/sql/hstore_utf8.sql index face878324..38c9481ee6 100644 --- a/contrib/hstore/sql/hstore_utf8.sql +++ b/contrib/hstore/sql/hstore_utf8.sql @@ -17,3 +17,10 @@ SELECT E'key\u0105=>value\u0105'::hstore; SELECT 'keyą=>valueą'::hstore; SELECT 'ą=>ą'::hstore;
Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific
On Sat, Jun 17, 2023 at 10:57:05AM -0400, Evan Jones wrote: > However, if we think this change could be a problem, one fix would be to > switch scanner_isspace() to array_isspace(), which returns true for these > *six* ASCII characters. I am happy to submit a patch to do this. The difference between scanner_isspace() and array_isspace() is that the former matches with what scan.l stores as rules for whitespace characters, but the latter works on values. For hstore, we want the latter, with something that works on values. To keep the change locale to hstore, I think that we should just introduce an hstore_isspace() which is a copy of array_isspace. That's a duplication, sure, but I think that we may want to think harder about \v in the flex scanner, and that's just a few extra lines for something that has not changed in 13 years for arrays. That's also easier to think about for stable branches. If you can send a patch, that helps a lot, for sure! Worth noting that the array part has been changed in 2010, with 95cacd1, for the same reason as what you've proposed for hstore. Thread is here, and it does not mention our flex rules, either: https://www.postgresql.org/message-id/8f72262c-5694-4626-a87f-00604fb5e...@trumpet.io Perhaps we could consider \v as a whitespace in the flex scanner itself, but I am scared to do that in any stable branch. Perhaps we could consider that for HEAD in 17~? That's a lot to work around an old BSD bug that macOS has inherited, though. -- Michael signature.asc Description: PGP signature
Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific
Unfortunately I just noticed a possible "bug" with this change. The scanner_isspace() function only recognizes *five* ASCII space characters: ' ' \t \n \r \f. It *excludes* VTAB \v, which the C standard function isspace() includes. This means this patch changed the behavior of hstore parsing for some "unusual" cases where the \v character was previously ignored, and now is not, such as: "select 'k=>\vv'::hstore" . It seems unlikely to me that anyone would be depending on this. The application/programming language library would need to be explicitly depending on VTAB being ignored as leading/trailing characters for hstore key/values. I am hopeful that most implementations encode hstore values the same way Postgres does: always using quoted strings, which avoids this problem. However, if we think this change could be a problem, one fix would be to switch scanner_isspace() to array_isspace(), which returns true for these *six* ASCII characters. I am happy to submit a patch to do this. However, I am now wondering if the fact that scanner_isspace() and array_isspace() disagree with each other could be problematic somewhere, but so far I haven found anything. Problematic example before my hstore change: $ printf "select 'k=>\vv'::hstore" | psql hstore -- "k"=>"v" (1 row) Same example after my hstore change on postgres master commit a14e75eb0b from 2023-06-16: $ printf "select 'k=>\vv'::hstore" | psql hstore -- "k"=>"\x0Bv" (1 row) On Sun, Jun 11, 2023 at 8:18 PM Michael Paquier wrote: > On Tue, Jun 06, 2023 at 10:16:09AM -0400, Evan Jones wrote: > > I did a quick look at the places found with "git grep isspace" > yesterday. I > > agree with the comment from commit 9ae2661: "I've left alone isspace() > > calls in places that aren't really expecting any non-ASCII input > > characters, such as float8in()." There are a number of other calls where > I > > think it would likely be safe, and possibly even a good idea, to replace > > isspace() with scanner_isspace(). However, I couldn't find any where I > > could cause a bug like the one I hit in hstore parsing. > > Yes, I agree with this feeling. Like 9ae2661, I can't get really > excited about plastering more of that, especially if it were for > timezone value input or dictionary options. One area with a new > isspace() since 2017 is multirangetypes.c, but it is just a copy of > rangetypes.c. > > > Original mailing list post for commit 9ae2661 in case it is helpful for > > others: > https://www.postgresql.org/message-id/10129.1495302...@sss.pgh.pa.us > > I have reproduced the original problem reported on macOS 13.4, which > is close to the top of what's available. > > Passing to pg_regress some options to use something else than UTF-8 > leads to a failure in the tests, so we need a split like > fussyztrmatch to test that: > REGRESS_OPTS='--encoding=SQL_ASCII --no-locale' make check > > An other error pattern without a split could be found on Windows, as > of: > select E'key\u0105=>value'::hstore; > - hstore > -- > - "keyÄ…"=>"value" > -(1 row) > - > +ERROR: character with byte sequence 0xc4 0x85 in encoding "UTF8" has > no equivalent in encoding "WIN1252" > +LINE 1: select E'key\u0105=>value'::hstore; > > We don't do that for unaccent, actually, leading to similar failures.. > I'll launch a separate thread about that shortly. > > With that fixed, the fix has been applied and backpatched. Thanks for > the report, Evan! > -- > Michael >
Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific
On Tue, Jun 06, 2023 at 10:16:09AM -0400, Evan Jones wrote: > I did a quick look at the places found with "git grep isspace" yesterday. I > agree with the comment from commit 9ae2661: "I've left alone isspace() > calls in places that aren't really expecting any non-ASCII input > characters, such as float8in()." There are a number of other calls where I > think it would likely be safe, and possibly even a good idea, to replace > isspace() with scanner_isspace(). However, I couldn't find any where I > could cause a bug like the one I hit in hstore parsing. Yes, I agree with this feeling. Like 9ae2661, I can't get really excited about plastering more of that, especially if it were for timezone value input or dictionary options. One area with a new isspace() since 2017 is multirangetypes.c, but it is just a copy of rangetypes.c. > Original mailing list post for commit 9ae2661 in case it is helpful for > others: https://www.postgresql.org/message-id/10129.1495302...@sss.pgh.pa.us I have reproduced the original problem reported on macOS 13.4, which is close to the top of what's available. Passing to pg_regress some options to use something else than UTF-8 leads to a failure in the tests, so we need a split like fussyztrmatch to test that: REGRESS_OPTS='--encoding=SQL_ASCII --no-locale' make check An other error pattern without a split could be found on Windows, as of: select E'key\u0105=>value'::hstore; - hstore -- - "keyÄ…"=>"value" -(1 row) - +ERROR: character with byte sequence 0xc4 0x85 in encoding "UTF8" has no equivalent in encoding "WIN1252" +LINE 1: select E'key\u0105=>value'::hstore; We don't do that for unaccent, actually, leading to similar failures.. I'll launch a separate thread about that shortly. With that fixed, the fix has been applied and backpatched. Thanks for the report, Evan! -- Michael signature.asc Description: PGP signature
Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific
On Tue, Jun 6, 2023 at 7:37 AM Michael Paquier wrote: > Indeed. It looks like 9ae2661 missed this spot. > I didn't think to look for a previous fix, thanks for finding this commit id! I did a quick look at the places found with "git grep isspace" yesterday. I agree with the comment from commit 9ae2661: "I've left alone isspace() calls in places that aren't really expecting any non-ASCII input characters, such as float8in()." There are a number of other calls where I think it would likely be safe, and possibly even a good idea, to replace isspace() with scanner_isspace(). However, I couldn't find any where I could cause a bug like the one I hit in hstore parsing. Original mailing list post for commit 9ae2661 in case it is helpful for others: https://www.postgresql.org/message-id/10129.1495302...@sss.pgh.pa.us
Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific
On Mon, Jun 05, 2023 at 11:26:56AM -0400, Evan Jones wrote: > This patch fixes a rare parsing bug with unicode characters on Mac OS X. > The problem is that isspace() on Mac OS X changes its behaviour with the > locale. Use scanner_isspace instead, which only returns true for ASCII > whitespace. It appears other places in the Postgres code have already run > into this, since a number of places use scanner_isspace instead. However, > there are still a lot of other calls to isspace(). I'll try to take a quick > look to see if there might be other instances of this bug. Indeed. It looks like 9ae2661 missed this spot. -- Michael signature.asc Description: PGP signature