>- Message should be stored in their original forms. I.e. The > character encoding transformation should only be done for > display/access purposes.
Completely, 100% agree here. >- I think using a character encoding library is unavoidable. Is iconv() > sufficient?. If UTF-8 is to be used as the normalized encoding > format, a library is needed that can transform the various encodings > into it, and likely from it. Maybe it is not as big an issue as it > was in the past, but not everyone was sold on Unicode. In my > mail-related project, I had users that preferred they local character > encoding formats over anything Unicode related. Weeeeel .... not exactly. It's not just a transformation issue; if it was, iconv() would be fine. The issue in the format engine is: we need to know about things, like is ' ' a space? (the format engine does space compression) If the strings are UTF-8, we can't use isspace() on it. We can't even use iswspace(), because that requires the locale to be set to an UTF-8 locale. So we need a library that can process UTF-8, regardless of the locale setting. > Character encoding choices can get quite political. > > If a library is adopted, then users have full control of what encoding > they prefer. Well, I was thinking that the locale would control the display/encoding character set, like it does now. >- As for parsing message headers, make it a configurable option > on what the default character encoding should be. UTF-8 could be the > default (which is fortunately is US-ASCII compatible). > > Real-world note: I have encountered emails that actually use a > non-ASCII default encoding for message header data. Messages in > non-English locale. Technically, these message are not conformant to > the RFCs, but such messages actually exist. Hence, in my project, I > support an option that specifies what the default encoding is. While I understand where you're coming from, back before EAI those messages were invalid according to the RFCs. Now the RFCs have changed and those messages are defined as being UTF-8, full stop, no exceptions. I understand the need to define a default character set for messages which don't meet the RFCs, but it feels wrong to me to allow the user to override the interpretation of a message which is now legal. I welcome discussion in this area. >- I think it is perfectly reasonable to leverage the current locale > setting to determine defaults, but one should be able to explicit > override such defaults via .mh_profile and command-line options. Well, a user can already override that by changing locale environment variables. To me that seems like the right mechanism; you can do that on the command line, with shell wrappers, whatever. >Warning message(s) should be generated when character data is lost due >to conversion. It's unclear to me where those messages should go, and it doesn't seem like anyone else does that. --Ken _______________________________________________ Nmh-workers mailing list Nmh-workers@nongnu.org https://lists.nongnu.org/mailman/listinfo/nmh-workers