On 29/05/2026 14:44, Hannu Krosing wrote:
Hi hackers
I was loading our mailing list into a database and noticed that some
text results int substring not working.
Specifically calling substring with some specific values fails
badutf8=# select ctid, id, substring(body, 1, 4) from bademail;
ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xc2
LOCATION: report_invalid_encoding_int, mbutils.c:1847
Asking one byte longer substring works ok
badutf8=# select ctid, id, substring(body, 1, 5) from bademail;
ctid │ id │ substring
───────┼────┼───────────
(0,1) │ 1 │ Hi ev
(1 row)
as do other ways of getting the same 4 bytes
badutf8=# select ctid, id, substring(body from '^.{4}') from bademail;
ctid │ id │ substring
───────┼────┼───────────
(0,1) │ 1 │ Hi e
(1 row)
badutf8=# select ctid, id, substring(normalize(body), 1, 4) from bademail;
ctid │ id │ substring
───────┼────┼───────────
(0,1) │ 1 │ Hi e
(1 row)
is this expected behaviour and I just have to always noirmalize when
loading exotic UTF8 strings ?
Is the body valid UTF-8 or not? If it's not valid, then you shouldn't be
able to load it into the database in the first place. If it is valid,
then the substring() should work.
If you want to replicate this use attached python script to load data
I could not reproduce this. That substring() query after running your
script works fine for me.
badutf8=# select ctid, id, substring(body, 1, 4) from bademail;
ctid | id | substring
-------+----+-----------
(0,1) | 1 | Hi e
(1 row)
Which version did you use? What is the database's encoding and what is
the client encoding? I used 'master', with UTF-8 as server and client
encoding.
- Heikki