Body is valid, can be selected in full and the the body can be also fed to other functions, including version of substring using regexes.
badutf8=# select ctid, id, length(body), length(normalize(body)), body=normalize(body), octet_length(body), octet_length(normalize(body)) from bademail; ctid │ id │ length │ length │ ?column? │ octet_length │ octet_length ───────┼────┼────────┼────────┼──────────┼──────────────┼────────────── (0,1) │ 1 │ 2314 │ 2314 │ t │ 2323 │ 2323 (1 row) The most confusing thing is that the byte it complains about in case of substring(body, 1, 3) or substring(body, 1, 4) does not seem to be present in the original string at all and definitely not within the first few characters I ams asking to extract badutf8=# select ctid, id, substring(body, 1, 2) from bademail; ctid │ id │ substring ───────┼────┼─────────── (0,1) │ 1 │ Hi (1 row) Time: 0.527 ms badutf8=# select ctid, id, substring(body, 1, 3) from bademail; ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xc3 LOCATION: report_invalid_encoding_int, mbutils.c:1847 Time: 0.638 ms badutf8=# select ctid, id, substring(body, 1, 4) from bademail; ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xc2 LOCATION: report_invalid_encoding_int, mbutils.c:1847 Time: 0.555 ms badutf8=# select ctid, id, substring(body, 1, 5) from bademail; ctid │ id │ substring ───────┼────┼─────────── (0,1) │ 1 │ Hi ev (1 row) On Fri, May 29, 2026 at 3:29 PM Heikki Linnakangas <[email protected]> wrote: > > On 29/05/2026 14:44, Hannu Krosing wrote: > > Hi hackers > > > > I was loading our mailing list into a database and noticed that some > > text results int substring not working. > > > > Specifically calling substring with some specific values fails > > > > badutf8=# select ctid, id, substring(body, 1, 4) from bademail; > > ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xc2 > > LOCATION: report_invalid_encoding_int, mbutils.c:1847 > > > > Asking one byte longer substring works ok > > > > badutf8=# select ctid, id, substring(body, 1, 5) from bademail; > > ctid │ id │ substring > > ───────┼────┼─────────── > > (0,1) │ 1 │ Hi ev > > (1 row) > > > > as do other ways of getting the same 4 bytes > > > > badutf8=# select ctid, id, substring(body from '^.{4}') from bademail; > > ctid │ id │ substring > > ───────┼────┼─────────── > > (0,1) │ 1 │ Hi e > > (1 row) > > > > badutf8=# select ctid, id, substring(normalize(body), 1, 4) from bademail; > > ctid │ id │ substring > > ───────┼────┼─────────── > > (0,1) │ 1 │ Hi e > > (1 row) > > > > is this expected behaviour and I just have to always noirmalize when > > loading exotic UTF8 strings ? > > Is the body valid UTF-8 or not? If it's not valid, then you shouldn't be > able to load it into the database in the first place. If it is valid, > then the substring() should work. > > > If you want to replicate this use attached python script to load data > > I could not reproduce this. That substring() query after running your > script works fine for me. > > badutf8=# select ctid, id, substring(body, 1, 4) from bademail; > ctid | id | substring > -------+----+----------- > (0,1) | 1 | Hi e > (1 row) > > Which version did you use? What is the database's encoding and what is > the client encoding? I used 'master', with UTF-8 as server and client > encoding. > > - Heikki >
