В письме от среда, 13 июля 2022 г. 16:14:39 MSK пользователь Aleksander Alekseev написал:
Hi! Let me join the review process. Postgres data types is field of expertise I am interested in. 0. Though it looks like a steady bug, I can't reproduce it. Not using valgrind, not using ASan (address sanitizer should catch reading out of bounds too). I am running Debian Bullseye, and tried to build both postgresl 14.4 and current master. Never the less I would dig into this issue. And start with the parts that is not covered by the patch, but seems to be important for me. 1. typename "char" (with quotes) is very-very-very confusing. it is described in documentation, but you need to be postgres expert or careful documentation reader, to notice important difference between "char" and char. What is the history if "char" type? Is it specified by some standard? May be it is good point to create more understandable alias, like byte_char, ascii_char or something for usage in practice, and keep "char" for backward compatibility only. 2. I would totally agree with Tom Lane and Isaac Morland, that problem should be also fixed on the side of type conversion. There is whole big thread about it. Guess we should come to some conclusion there 3.Fixing out of bound reading for broken unicode is also important. Though for now I am not quite sure it is possible. > - p += pg_mblen(p); > + { > + int t = pg_mblen(p); > + p += t; > + max_copy_bytes -= t; > + } Minor issue: Here I would change variable name from "t" to "char_len" or something, to make code more easy to understand. Major issue: is pg_mblen function safe to call with broken encoding at the end of buffer? What if last byte of the buffer is 0xF0 and you call pg_mblen for it? >+ copy_bytes = p - s; >+ if(copy_bytes > max_copy_bytes) >+ copy_bytes = max_copy_bytes; Here I would suggest to add comment about broken utf encoding case. That would explain why we might come to situation when we can try to copy more than we have. I would also suggest to issue a warning here. I guess person that uses postgres would prefer to know that he managed to stuff into postgres a string with broken utf encoding, before it comes to some terrible consequences. > Hi Spyridon, > > > The column "single_byte_col" is supposed to store only 1 byte. > > Nevertheless, the INSERT command implicitly casts the '🀆' text into > > "char". This means that only the first byte of '🀆' ends up stored in the > > column. gdb reports that "pg_mblen(p) = 4" (line 1046), which is expected > > since the pg_mblen('🀆') is indeed 4. Later at line 1050, the memcpy will > > copy 4 bytes instead of 1, hence an out of bounds memory read happens for > > pointer 's', which effectively copies random bytes. > Many thanks for reporting this! > > > - OS: Ubuntu 20.04 > > - PSQL version 14.4 > > I can confirm the bug exists in the `master` branch as well and > doesn't depend on the platform. > > Although the bug is easy to fix for this particular case (see the > patch) I'm not sure if this solution is general enough. E.g. is there > something that generally prevents pg_mblen() from doing out of bound > reading in cases similar to this one? Should we prevent such an INSERT > from happening instead? -- Nikolay Shaplov aka Nataraj Fuzzing Engineer at Postgres Professional Matrix IM: @dhyan:nataraj.su
signature.asc
Description: This is a digitally signed message part.