Thanks for the clarification.
I agree that validation on every input may have runtime-cost concerns. But this can be well-controlled. For example, MySQL adopts a finer checking for EUC-CN (i.e., GB2312) in https://github.com/mysql/mysql-server/blob/trunk/strings/ctype-gb2312.cc: ``` static int func_gb2312_uni_onechar(int code) { if ((code >= 0x2121) && (code <= 0x2658)) return (tab_gb2312_uni0[code - 0x2121]); if ((code >= 0x2721) && (code <= 0x296F)) return (tab_gb2312_uni1[code - 0x2721]); if ((code >= 0x3021) && (code <= 0x777E)) return (tab_gb2312_uni2[code - 0x3021]); return (0); } ``` where `code` is obtained by subtracting 0x8080. Of course, MySQL's checking can also be enhanced. Anyway, it is reasonable to note these details in the documentation. On Sat, May 2, 2026 at 11:28 AM David G. Johnston < [email protected]> wrote: > On Friday, May 1, 2026, Zhongpu Chen <[email protected]> wrote: > >> The issue is not specific to E'\\x..' literals. A normal COPY FROM data >> file with ENCODING 'EUC_CN' can create text rows that later cannot be >> retrieved with SELECT. >> >> This suggests that input validation for EUC_CN is only structural, while >> the EUC_CN-to-UTF8 conversion table is stricter. >> > > I suspect a lack of desire to maintain and ensure that specific values are > verified; or accepting the runtime cost to do so. It is indeed > structural. This point should probably be documented better. But it’s > hard to feel too bad if the input claims it is providing verifiable EUC_CN > data then proceeds to supply data that lacks meaning in reality. We are > happy to just store and return your data to you - but it’s unreasonable to > ask for it to be converted. It would be nice for the database to provide > an extra layer of protection, so I’m not against the idea. Either > automatically or or at least providing a function that could, say, be > called in a trigger for opt-in. But definitely feels like a problematic > benefit-to-cost proposition. > > David J. > > -- Zhongpu Chen
