Mark Sapiro writes: > > content = content.encode(decoding) > > > > UnicodeEncodeError: 'gb2312' codec can't encode character '\ufffd' in > > position 3131: illegal multibyte sequence > > > > Apparently the offending attachments are specified as gb2312 (a common > > Chinese encoding). > > > > Is there something I can do to somehow preprocess the archive mboxes, or > > otherwise re-encode the attachments? > > Possibly there is, but this is a bug in the hyperkitty_import process.
Technically, it's a bug in common Chinese MUAs. We can work around it if we want to, of course, and I think we do. <tl;dr endsat="Whew!"> The backstory is that Chinese (simplified, aka mainland) has three major encoding standards: GB 2312, GBK, and GB 18030. GBK is not really an encoding, it's an encoding schema which says "future Chinese encodings shall be supersets of GB 2312" but doesn't assign any new characters, and GB 18030 is not only a superset of GB 2312 that actually defines the new characters compatibly with GBK, but it is also a superset of Unicode that folds Unicode into the GBK code space algorithmically (GB 2312 and Unicode are incompatible in page 0). Whew! So, because GB 18030 is backward compatible with GB 2312, a lot of Chinese MUAs get away with incorrectly labeling the extended character set "GB 2312", and you get the error above. The same thing happens with Shift JIS, by the way. OTOH, for that exact reason, we can do what Webencodings does, and promote GB 2312 claims, and *decode* with GB 18030. I think this is safe, as there's really no alternative encoding to worry about, and since this stuff presumably all text/plain or text/html, we should be OK on security stuff (although I guess in theory it could be source code or executable scripts that is doing something sneaky). (On the other hand, I *am* worried about the fact that there is a REPLACEMENT CHARACTER in the content at this point. Presumably that's because we *decoded* the original mail with errors=who-gives-a-fsck, which is not appropriate here---we can be almost sure that the content is *not* corrupt, rather it's mislabeled.) The OP can do a poor man's version, by going through the existing mbox and case-independently regexp-replacing r"=\?GB2312\?" with r"=\?GB18030\?", and r'charset=("?)GB2312' with r'charset=\1GB18030'. I'm still jet-lagged from PyCon, so I'm not going to do more now, and if you want some Python code to do this, please feel free to ping me on or off list. > It would help if you file an issue at > <https://gitlab.com/mailman/hyperkitty/issues/new> with enough > information for us to reproduce it. print(""" Subject: nothing to see here: =?GB2312?Q?=FF=FD?= Oops! """) should do the trick. ;-) I'll be looking for this issue, or you can assign it to me. Steve ------------------------------------------------------ Mailman-Users mailing list Mailman-Users@python.org https://mail.python.org/mailman/listinfo/mailman-users Mailman FAQ: http://wiki.list.org/x/AgA3 Security Policy: http://wiki.list.org/x/QIA9 Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/ Unsubscribe: https://mail.python.org/mailman/options/mailman-users/archive%40jab.org