В письме от среда, 13 июля 2022 г. 16:14:39 MSK пользователь Aleksander 
Alekseev написал:

Hi! Let me join the review process. Postgres data types is field of expertise I 
am interested in.

0. Though it looks like a steady bug, I can't reproduce it. Not using 
valgrind, not using ASan (address sanitizer should catch reading out of bounds 
too). I am running Debian Bullseye, and tried to build both postgresl 14.4 and 
current master.

Never the less I would dig into this issue. And start with the parts that is 
not covered by the patch, but seems to be important for me.

1. typename "char" (with quotes) is very-very-very confusing. it is described 
in documentation, but you need to be postgres expert or careful documentation 
reader, to notice important difference between "char" and char. 
What is the history if "char" type? Is it specified by some standard? May be it 
is good point to create more understandable alias, like byte_char, ascii_char 
or something for usage in practice, and keep "char" for backward compatibility 
only.

2. I would totally agree with  Tom Lane and Isaac Morland, that problem should 
be also fixed  on the side of type conversion.  There is whole big thread about 
it. Guess we should come to some conclusion there

3.Fixing out of bound reading for broken unicode is also important.  Though 
for now I am not quite sure it is possible.


> -                     p += pg_mblen(p);
> +             {
> +                     int t = pg_mblen(p);
> +                     p += t;
> +                     max_copy_bytes -= t;
> +             }

Minor issue: Here I would change variable name from "t" to "char_len" or 
something, to make code more easy to understand.

Major issue: is pg_mblen function safe to call with broken encoding at the end 
of buffer? What if last byte of the buffer is 0xF0 and you call pg_mblen for it?


>+              copy_bytes = p - s;
>+              if(copy_bytes > max_copy_bytes)
>+                      copy_bytes = max_copy_bytes;

Here I would suggest to add comment about broken utf encoding case. That would 
explain why we might come to situation when we can try to copy more than we 
have.

I would also suggest to issue a warning here. I guess person that uses 
postgres would prefer to know that he managed to stuff into postgres a string 
with broken utf encoding, before it comes to some terrible consequences.  

> Hi Spyridon,
> 
> > The column "single_byte_col" is supposed to store only 1 byte.
> > Nevertheless, the INSERT command implicitly casts the '🀆' text into
> > "char". This means that only the first byte of '🀆' ends up stored in the
> > column. gdb reports that "pg_mblen(p) = 4" (line 1046), which is expected
> > since the pg_mblen('🀆') is indeed 4. Later at line 1050, the memcpy will
> > copy 4 bytes instead of 1, hence an out of bounds memory read happens for
> > pointer 's', which effectively copies random bytes.
> Many thanks for reporting this!
> 
> > - OS: Ubuntu 20.04
> > - PSQL version 14.4
> 
> I can confirm the bug exists in the `master` branch as well and
> doesn't depend on the platform.
> 
> Although the bug is easy to fix for this particular case (see the
> patch) I'm not sure if this solution is general enough. E.g. is there
> something that generally prevents pg_mblen() from doing out of bound
> reading in cases similar to this one? Should we prevent such an INSERT
> from happening instead?


-- 
Nikolay Shaplov aka Nataraj
Fuzzing Engineer at Postgres Professional
Matrix IM: @dhyan:nataraj.su

Attachment: signature.asc
Description: This is a digitally signed message part.

Reply via email to