Re: [PATCH] json_lex_string: don't overread on bad UTF8

2024-05-09 Thread Jacob Champion
On Wed, May 8, 2024 at 9:27 PM Michael Paquier wrote: > This is a bit mitigated by the fact that d6607016c738 is recent, but > this is incorrect since v13 so backpatched down to that. Thank you! --Jacob

Re: [PATCH] json_lex_string: don't overread on bad UTF8

2024-05-08 Thread Michael Paquier
On Wed, May 08, 2024 at 07:01:08AM -0700, Jacob Champion wrote: > On Tue, May 7, 2024 at 10:31 PM Michael Paquier wrote: >> But looking closer, I can see that in the JSON_INVALID_TOKEN case, >> when !tok_done, we set token_terminator to point to the end of the >> token, and that would include an

Re: [PATCH] json_lex_string: don't overread on bad UTF8

2024-05-08 Thread Jacob Champion
On Tue, May 7, 2024 at 10:31 PM Michael Paquier wrote: > But looking closer, I can see that in the JSON_INVALID_TOKEN case, > when !tok_done, we set token_terminator to point to the end of the > token, and that would include an incomplete byte sequence like in your > case. :/ Ah, I see what

Re: [PATCH] json_lex_string: don't overread on bad UTF8

2024-05-07 Thread Michael Paquier
On Tue, May 07, 2024 at 02:06:10PM -0700, Jacob Champion wrote: > Maybe I've misunderstood, but isn't that what's being done in v2? Something a bit different.. I was wondering if it could be possible to tweak this code to truncate the data in the generated error string so as the incomplete

Re: [PATCH] json_lex_string: don't overread on bad UTF8

2024-05-07 Thread Jacob Champion
On Mon, May 6, 2024 at 8:43 PM Michael Paquier wrote: > On Fri, May 03, 2024 at 07:05:38AM -0700, Jacob Champion wrote: > > We could port something like that to src/common. IMO that'd be more > > suited for an actual conversion routine, though, as opposed to a > > parser that for the most part

Re: [PATCH] json_lex_string: don't overread on bad UTF8

2024-05-06 Thread Michael Paquier
On Fri, May 03, 2024 at 07:05:38AM -0700, Jacob Champion wrote: > On Fri, May 3, 2024 at 4:54 AM Peter Eisentraut wrote: >> but for the general encoding conversion we have what >> would appear to be the same behavior in report_invalid_encoding(), and >> we go out of our way there to produce a

Re: [PATCH] json_lex_string: don't overread on bad UTF8

2024-05-03 Thread Jacob Champion
On Fri, May 3, 2024 at 4:54 AM Peter Eisentraut wrote: > > On 30.04.24 19:39, Jacob Champion wrote: > > Tangentially: Should we maybe rethink pieces of the json_lex_string > > error handling? For example, do we really want to echo an incomplete > > multibyte sequence once we know it's bad? > > I

Re: [PATCH] json_lex_string: don't overread on bad UTF8

2024-05-03 Thread Peter Eisentraut
On 30.04.24 19:39, Jacob Champion wrote: Tangentially: Should we maybe rethink pieces of the json_lex_string error handling? For example, do we really want to echo an incomplete multibyte sequence once we know it's bad? I can't quite find the place you might be looking at in

Re: [PATCH] json_lex_string: don't overread on bad UTF8

2024-05-02 Thread Jacob Champion
On Wed, May 1, 2024 at 8:40 PM Michael Paquier wrote: > > On Thu, May 02, 2024 at 11:23:13AM +0900, Michael Paquier wrote: > > About the fact that we may finish by printing unfinished UTF-8 > > sequences, I'd be curious to hear your thoughts. Now, the information > > provided about the partial

Re: [PATCH] json_lex_string: don't overread on bad UTF8

2024-05-01 Thread Michael Paquier
On Thu, May 02, 2024 at 11:23:13AM +0900, Michael Paquier wrote: > About the fact that we may finish by printing unfinished UTF-8 > sequences, I'd be curious to hear your thoughts. Now, the information > provided about the partial byte sequences can be also useful for > debugging on top of having

Re: [PATCH] json_lex_string: don't overread on bad UTF8

2024-05-01 Thread Michael Paquier
On Wed, May 01, 2024 at 04:22:24PM -0700, Jacob Champion wrote: > On Tue, Apr 30, 2024 at 11:09 PM Michael Paquier wrote: >> Not sure to like much the fact that this advances token_terminator >> first. Wouldn't it be better to calculate pg_encoding_mblen() first, >> then save token_terminator?

Re: [PATCH] json_lex_string: don't overread on bad UTF8

2024-05-01 Thread Jacob Champion
On Tue, Apr 30, 2024 at 11:09 PM Michael Paquier wrote: > Not sure to like much the fact that this advances token_terminator > first. Wouldn't it be better to calculate pg_encoding_mblen() first, > then save token_terminator? I feel a bit uneasy about saving a value > in token_terminator past

Re: [PATCH] json_lex_string: don't overread on bad UTF8

2024-05-01 Thread Michael Paquier
On Tue, Apr 30, 2024 at 10:39:04AM -0700, Jacob Champion wrote: > When json_lex_string() hits certain types of invalid input, it calls > pg_encoding_mblen_bounded(), which assumes that its input is > null-terminated and calls strnlen(). But the JSON lexer is constructed > with an explicit string

[PATCH] json_lex_string: don't overread on bad UTF8

2024-04-30 Thread Jacob Champion
Hi all, When json_lex_string() hits certain types of invalid input, it calls pg_encoding_mblen_bounded(), which assumes that its input is null-terminated and calls strnlen(). But the JSON lexer is constructed with an explicit string length, and we don't ensure that the string is null-terminated