2018-07-23 5:40 GMT+08:00 Bruno Haible <br...@clisp.org>: > Pádraig Brady wrote: >> > This patch is correct (because the characters that you test for in >> > c_iscntrl >> > are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a >> > multibyte >> > character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings). >> >> ... It might be worth mentioning this subtle point in the c_iscntrl() docs? >> "Note this identifies all single byte control chars even in multibyte >> encodings". > > Only in the multibyte encodings that are currently in use. We never know what > kinds of features or misfeatures new multibyte encodings will come up with: > Before GB18030 was introduced, it was a common feature of all multibyte > encodings > (including SJIS) that ASCII characters in the range 0x00..0x3F never occur as > second or later byte in a multibyte character. Well, GB18030 broke this > assumption. > > So, it is dangerous to rely on this property. Therefore I wouldn't like to > document it in the c_iscntrl() documentation. > > Bruno >
Hello any update on this? Discussions about encodings are beyond my knowledge, yet I can feel that it's difficult to correctly filter control characters. How about following the idea from Pádraig Brady and filter \n only?