My prototype implementation: https://github.com/SWUFE-DB-Group/postgresql-encoding-validation and the usage: https://github.com/SWUFE-DB-Group/postgresql-encoding-validation/blob/main/DEV.md
On Sat, May 9, 2026 at 4:58 PM Zhongpu Chen <[email protected]> wrote: > > If so, tightening up the validation may break such that uses. > > I agree. What about introducing an extra GUC which allows users to specify > verification logic? In fact, I have implemented this patch. > > ``` > SHOW encoding_validation; > -- default behaviour > SET encoding_validation = 'native'; > -- enforce Write to be fully compatible with Read > SET encoding_validation = 'read_compatible'; > ``` > > On Wed, May 6, 2026 at 8:19 PM Tatsuo Ishii <[email protected]> wrote: > >> > It is in general not necessarily required that all text in all >> > non-UTF8 encodings must be convertible to UTF8. >> > >> > (This is also a result of history: These encodings were implemented in >> > PostgreSQL before Unicode.) >> > >> > That said, I can see how different behaviors might be desirable. >> > >> > My first question would be, are these non-convertible byte sequences >> > just characters that don't map to Unicode, or are they invalid within >> > the definition of the EUC-* encodings themselves? >> >> A strict answer is, the former. 0xA2A3 is 3 of lowercase forms of the >> Roman numerals (iii), which is not defined in the original GB2312 >> (the character set of EUC_CN), >> >> > If the latter, then >> > we should just reject them (modulo some backward compatibility), >> > similar to how we reject certain Unicode code points that exist >> > "structurally" but are not valid for one reason or another. >> >> After GB2312, GB18030 was defined. (It is claimed that GB18030 is a >> super set of GB2312). In DB18030, lowercase forms of the Roman >> numerals and other characters (e.g. Euro sign) were added. >> >> So I suspect that a) those characters are sometimes used with EUC_CN >> despite the fact that they are not valid GB2312 characters. b) some >> EUC_CN users might have already written those characters into EUC_CN >> databases. If so, tightening up the validation may break such that >> uses. This is just my guess. Please correct me if I am wrong. >> >> > Alternatively, if these byte sequences are valid characters but they >> > just didn't end up in Unicode for some reason, then rejecting them >> > might break valid uses. >> >> That's not the case, at least for 0xA2A3. It seems UCS_ti_EUC_CN.pl >> explicitly rejects characters that are not part of GB2312, including >> 0xA2A3, as the script is using GB18030 as a source data. >> >> Regards, >> -- >> Tatsuo Ishii >> SRA OSS K.K. >> English: http://www.sraoss.co.jp/index_en/ >> Japanese:http://www.sraoss.co.jp >> > > > -- > Zhongpu Chen > -- Zhongpu Chen
