Re: [Nmh-workers] nmh architecture discussion: format engine character set
- For _display_, try to convert all of the characters to the native character set (yes, using the locale, dammit!). OK. Is that using POSIX, or does it require something extra? POSIX includes iconv(), which is adequate. If the Unicode library we need to use has a charset conversion API that is better, we should use that (my beef with iconv() is that you cannot give a substitution character, which requires some awkward handling for dealing with substitution characters). - Reconvert such messages to 'canonical' standard while sending. Well, I think just for addresses; leaving everything else as an encoded word might not be harmful. But I'd have to think about it. The only thing I can think of is if something somewhere suggests a preferred format when multiple are valid, like an ASCII subject should be just the subject. Kind of like how it's annoying Android base64s bodies IIRC. AFAIK, this shouldn't be a concern; we already have a fair amount of code that produces the 'minimal' encoding (e.g., we don't use base64 or q-p unless it's a requirement). --Ken ___ Nmh-workers mailing list Nmh-workers@nongnu.org https://lists.nongnu.org/mailman/listinfo/nmh-workers
Re: [Nmh-workers] nmh architecture discussion: format engine character set
On Wed, Aug 12, 2015 at 8:55 PM, Ken Hornstein wrote: - Handle everything internally as UTF-8. - For _display_, try to convert all of the characters to the native character set (yes, using the locale, dammit!). - For things like _replies_, if we are not in an UTF-8 locale then downgrade things like the addresses using RFC 6587 rules (well, the subject as well ... I think the way it would work is the format engine would do the encoding for you behind the scenes for all components). - Reconvert such messages to 'canonical' standard while sending. Well, I think just for addresses; leaving everything else as an encoded word might not be harmful. But I'd have to think about it. - But this also makes it clear that the thoughts of having an 'external' decoder stage will simply not work; you need to know too much about each header, because they're all handled differently. Sorry for late reply... The above looks reasonable to me. The 'external' encoder/decoder is more of a pie-in-the-sky idea of allowing the encoder system being abstracted so one could plug in different engines if needed. Basically, using pipes into and out of it whenever an encoding/decoding operation is required. However, if the level of effort to achieve such an abstraction is not worth any potential benefit, do not bother with it. Note, there may be some benefit in providing some level of abstraction for the encoder if there is a concern of nmh getting locked-in code-wise to a specific library. --ewh ___ Nmh-workers mailing list Nmh-workers@nongnu.org https://lists.nongnu.org/mailman/listinfo/nmh-workers
Re: [Nmh-workers] nmh architecture discussion: format engine character set
Hi Ken, - For _display_, try to convert all of the characters to the native character set (yes, using the locale, dammit!). OK. Is that using POSIX, or does it require something extra? - Reconvert such messages to 'canonical' standard while sending. Well, I think just for addresses; leaving everything else as an encoded word might not be harmful. But I'd have to think about it. The only thing I can think of is if something somewhere suggests a preferred format when multiple are valid, like an ASCII subject should be just the subject. Kind of like how it's annoying Android base64s bodies IIRC. Cheers, Ralph. ___ Nmh-workers mailing list Nmh-workers@nongnu.org https://lists.nongnu.org/mailman/listinfo/nmh-workers
Re: [Nmh-workers] nmh architecture discussion: format engine character set
Ralph Corderoy ra...@inputplus.co.uk writes: Hi Ken, So ... what would that mean, exactly? Ignore the locale setting and always output UTF-8? Well, yes, the code would be writing UTF-8, with the knowledge of how many cells have been occupied, e.g. one for the combining `a⃞', but it could complain about the non-UTF-8 locale setting, or try and set up `fire and forget' converter on open and opening files if it was easy enough to be worth the bother. Help me out here, because I'm trying to translate your concepts into actual code and I'm having some problems seeing how it would work. Geez, how much hand-waving do you want a guy to do? :-) Assuming we don't bring in a library like ICU, GNU's libunistring might be an alternative to ICU. http://www.gnu.org/software/libunistring/ This small lib could be useful as well, expat-licensed and could even be vendored: https://github.com/JuliaLang/utf8proc -- Christian Neukirchen chneukirc...@gmail.com http://chneukirchen.org ___ Nmh-workers mailing list Nmh-workers@nongnu.org https://lists.nongnu.org/mailman/listinfo/nmh-workers
Re: [Nmh-workers] nmh architecture discussion: format engine character set
Take the reply command. The first thing it needs to do is read the original email data to generate the draft template for editing. The initial read operation is filtered thru the Encoder first. The result is passed into the nmh engine to parse header fields and other jazz to create the draft message (all of this is done in the UTF8 world). When writing the draft, the data is piped thru the encoder then written to disk before launching the editor (hopefully it is a no-op, but if in a non-UTF8 locale...). So I was about to say that we don't know what to do in that case, but I took a look at RFC 6587. It turns out that it spells out exactly how to 'downgrade' a message to only ASCII. This requires encoding domains in Punycode, using RFC 2047 and RFC 2231 where appropriate, and use RFC 2047 for addr-spec if the mailbox name contains UTF-8. This does not strike me as terrible, and the code is mostly written (well not to convert U-labels to A-labels, but pretty much every Unicode string library we've looked at has a Punycode encoder-decoder). So that suggests to me: - Handle everything internally as UTF-8. - For _display_, try to convert all of the characters to the native character set (yes, using the locale, dammit!). - For things like _replies_, if we are not in an UTF-8 locale then downgrade things like the addresses using RFC 6587 rules (well, the subject as well ... I think the way it would work is the format engine would do the encoding for you behind the scenes for all components). - Reconvert such messages to 'canonical' standard while sending. Well, I think just for addresses; leaving everything else as an encoded word might not be harmful. But I'd have to think about it. - But this also makes it clear that the thoughts of having an 'external' decoder stage will simply not work; you need to know too much about each header, because they're all handled differently. Thoughts? --Ken ___ Nmh-workers mailing list Nmh-workers@nongnu.org https://lists.nongnu.org/mailman/listinfo/nmh-workers
Re: [Nmh-workers] nmh architecture discussion: format engine character set
ken wrote: That problem space lives well outside of nmh. The people to rightly fix it are the xterm authors, and people writing keyboard drivers. These conversion layers belong inside the terminal I/O drivers, where they can fix the problem for everything. The people who at least have spoken up on this list in the past have not, AFAIK, lacked the technical ability to run in an UTF-8 locale; there's no work on xterm or terminal drivers that is necessary; that's all been done a long time ago. They have just chosen not to, e.g.: http://lists.nongnu.org/archive/html/nmh-workers/2012-01/msg00206.html http://lists.nongnu.org/archive/html/nmh-workers/2012-01/msg00203.html As Earl Hood pointed out: Character encoding choices can get quite political. I confess that I am surprised the UTF-8 or die crowd has been so unaminous so far. No one dissents from this view? Like I said, it simplifies a WHOLE bunch of code (at the cost of adding a new library dependency), so I would actually be fine with it. i don't think the current respondents represent a very wide demographic. paul =-- paul fox, p...@foxharp.boston.ma.us (arlington, ma, where it's 63.3 degrees) ___ Nmh-workers mailing list Nmh-workers@nongnu.org https://lists.nongnu.org/mailman/listinfo/nmh-workers
Re: [Nmh-workers] nmh architecture discussion: format engine character set
Hi Jon, I am in no way an expert on this. But, I won't let that stop me. That's the spirit! The reason why I think that Unicode is appropriate is that it has been designed to be a superset of all other character sets. Being that the RFCs allow the mixing of character sets, Unicode allows them to be represented without having to encode bank switching. I realize that doing this requires a library that does all of the Unicode character handling properly, which is not a trivial task. If you skim through the Table of Contents at the start of http://www.gnu.org/software/libunistring/manual/libunistring.html you'll see it handles a lot of the nitty-gritty for you. (Other libraries suggested probably do the same, I just happen to know this one.) Cheers, Ralph. ___ Nmh-workers mailing list Nmh-workers@nongnu.org https://lists.nongnu.org/mailman/listinfo/nmh-workers
Re: [Nmh-workers] nmh architecture discussion: format engine character set
Hi Ken, Any chance you can add a References header? Only the lack of it is breaking my funky mkdir/tree(1)-based threading. :-) Huh, I'm not? I guess I was under the impression that if you didn't have one, you could use In-Reply-To. I think that's true, but I didn't/haven't bothered coding for that eventuality so I thought I'd mention it anyway. Hm, I thought they were the same, but I guess they're not, are they? No. Looks like the References header includes previous Message-IDs. Yes, handy when one doesn't have the whole thread, say. I'll work on updating my wacky replcomps, but will manually include one in the short term. My wacky replgroupcomps has %; Make References: and In-reply-to: fields for threading. %; Use (void), (trim) and (putstr) to eat trailing whitespace. %; %{message-id}In-reply-to: %{message-id}\n%\ %{message-id}References: \ %{references}%(void{references})%(trim)%(putstr) %\ %(void{message-id})%(trim)%(putstr)\n%\ Cheers, Ralph. ___ Nmh-workers mailing list Nmh-workers@nongnu.org https://lists.nongnu.org/mailman/listinfo/nmh-workers
Re: [Nmh-workers] nmh architecture discussion: format engine character set
It appears the basic processing model is a pipeline: Raw - [Encoder] - UTF8 - [Processor] - UTF8 - [Encoder] - Output I understand where you're coming from ... but it's not that simple. We're going to a point where UTF-8 is going to appear in email addresses. That's technically allowed today under the new RFCs. The problem then becomes Okay, 'Output' in the above stage needs to be 'Input' when doing message replies. How, exactly, do we do that? It's not just a matter of slapping a pipe to iconv on the end of every command. --Ken ___ Nmh-workers mailing list Nmh-workers@nongnu.org https://lists.nongnu.org/mailman/listinfo/nmh-workers
Re: [Nmh-workers] nmh architecture discussion: format engine character set
Ken Hornstein wrote: Even if it can, I am unsure we can maintain the correct column position when dealing with things like combining characters. That is possible. wcwidth() returns 0 for combining characters. Do we have any specific cases where forcing a UTF-8 assumption actually helps? The POSIX API is clumsy but the fact that it deals in the current locale rather than UTF-8 doesn't make much difference. The code needs an API to know stuff like how wide a string is. Knowing you have a UTF-8 encoding doesn't really gain you anything. I think it'd be better to focus on real features. So if you want, for example, character substitution on conversion failure and libunistring helps then configure can check for it and disable the feature if it isn't found. As an aside, that particular feature only sounds useful if you're actually using a non-UTF-8 locale. Given that nmh is BSD licenced, I'd probably favour libicu over libunistring just for its licence. Checking on a Debian system, neither have vast numbers of reverse dependencies. Oliver ___ Nmh-workers mailing list Nmh-workers@nongnu.org https://lists.nongnu.org/mailman/listinfo/nmh-workers
Re: [Nmh-workers] nmh architecture discussion: format engine character set
I am in no way an expert on this. But, I won't let that stop me. Welcome to the club! I think we're all in the same boat in that regards. It seems to me that the only solution is to use Unicode internally. Disgusting as it seems to those of us who are old enough to hoard bytes, we might want to consider using something other than UTF-8 for the internal representation. Using UTF-16 wouldn't be horrible but I recall that the Unicode folks made a botch of things so that one really needs 24 bits now, which really means using 32 internally. AFAICT ... there is probably no advantage in using UTF-16 or UTF-32 versus UTF-8. People might think that you gain something because with UTF-16 two bytes == 1 character. But that's only true for things in the Basic Multilingual Plane, and people are now telling us because they want to send emoji in email which are NOT part of the BMP, which means we have to start dealing with like surrogate pairs. And really, even with just the BMP combining characters toss that idea out of the window UTF-32 lets you say 4 bytes == 1 character ... but do we care about 'characters' or 'column positions'? So given that, I think sticking with UTF-8 is preferrable; it has the nice property that we can represent text as C strings and it's just ASCII if you're living in a 7-bit world. On the output side, we just have to do the best we can if characters in the input locale can't be represented in the output locale. This is independent of the internal representation. Well, this works great if your locale is UTF-8. But ... what happens if your email address contains UTF-8, and your locale setting is ISO-8859-1? --Ken ___ Nmh-workers mailing list Nmh-workers@nongnu.org https://lists.nongnu.org/mailman/listinfo/nmh-workers
Re: [Nmh-workers] nmh architecture discussion: format engine character set
On 8/11/2015 11:33 AM, Ken Hornstein wrote: Well, this works great if your locale is UTF-8. But ... what happens if your email address contains UTF-8, and your locale setting is ISO-8859-1? Let me expand on this a bit, because I didn't explain it well. Obviously if your locale is ISO-8859-1, you probably won't have an email address that contains UTF-8. But ... what if you get an email with a 'From' address that contains UTF-8, , and you want to reply to it? Right now convert stuff to the local character set when constructing the reply draft; we can't do that here! Yep. One apparent deficiency of internalized email headers is the inability to encode characters. The MIME non-ASCII encoding syntax is limited to specific contexts and not applicable for addresses. An address encoding syntax should exist for the scenario you describe, allowing one the encode characters that cannot be represented natively in the current locale. However, it seems folks no longer want to support such environments. I guess if nmh ever encounters the scenario, it just errors out. --ewh ___ Nmh-workers mailing list Nmh-workers@nongnu.org https://lists.nongnu.org/mailman/listinfo/nmh-workers
Re: [Nmh-workers] nmh architecture discussion: format engine character set
That problem space lives well outside of nmh. The people to rightly fix it are the xterm authors, and people writing keyboard drivers. These conversion layers belong inside the terminal I/O drivers, where they can fix the problem for everything. The people who at least have spoken up on this list in the past have not, AFAIK, lacked the technical ability to run in an UTF-8 locale; there's no work on xterm or terminal drivers that is necessary; that's all been done a long time ago. They have just chosen not to, e.g.: http://lists.nongnu.org/archive/html/nmh-workers/2012-01/msg00206.html http://lists.nongnu.org/archive/html/nmh-workers/2012-01/msg00203.html As Earl Hood pointed out: Character encoding choices can get quite political. I confess that I am surprised the UTF-8 or die crowd has been so unaminous so far. No one dissents from this view? Like I said, it simplifies a WHOLE bunch of code (at the cost of adding a new library dependency), so I would actually be fine with it. --Ken ___ Nmh-workers mailing list Nmh-workers@nongnu.org https://lists.nongnu.org/mailman/listinfo/nmh-workers
Re: [Nmh-workers] nmh architecture discussion: format engine character set
Ken Hornstein wrote: We would be telling everyone if they're not using UTF-8, then we don't support you. So what does everything think of that? as long as there's a way to convert my existing mail store (folders --modernize), i'm game. note that i also argued for dropping ultrix, sunos3, and every other non-ansi non-posix system. so, i may be insane. -- Paul Vixie ___ Nmh-workers mailing list Nmh-workers@nongnu.org https://lists.nongnu.org/mailman/listinfo/nmh-workers
Re: [Nmh-workers] nmh architecture discussion: format engine character set
Hi Ken, Specifically, they assume all output is in UTF-8 (because that's how Plan 9 works), but that's not a valid assumption for us. Aside from whether that stdio would be helpful, is it time we switch to assuming UTF-8? So ... what would that mean, exactly? Ignore the locale setting and always output UTF-8? Well, yes, the code would be writing UTF-8, with the knowledge of how many cells have been occupied, e.g. one for the combining `a⃞', but it could complain about the non-UTF-8 locale setting, or try and set up `fire and forget' converter on open and opening files if it was easy enough to be worth the bother. Cheers, Ralph. ___ Nmh-workers mailing list Nmh-workers@nongnu.org https://lists.nongnu.org/mailman/listinfo/nmh-workers
Re: [Nmh-workers] nmh architecture discussion: format engine character set
Hi Ken, GNU's libunistring might be an alternative to ICU. http://www.gnu.org/software/libunistring/ Hm, I just looked at it; it's not terrible, is it? What do people think about creating a dependency on this library? I'm not sure how mature it is, though. http://git.savannah.gnu.org/cgit/libunistring.git/log/README has it going back to 2009, with some recent effort. Bruno has been dabbling in Unicode for a long time, http://www.haible.de/bruno/packages.html and wrote the Unicode HOWTO, http://www.tldp.org/HOWTO/Unicode-HOWTO.html On an Ubuntu system I've access to, package gettext depends on package libunistring0, so it could be getting some exercise. Cheers, Ralph. ___ Nmh-workers mailing list Nmh-workers@nongnu.org https://lists.nongnu.org/mailman/listinfo/nmh-workers
Re: [Nmh-workers] nmh architecture discussion: format engine character set
- The POSIX standard functions for this, wcwidth() and wcswidth(), work on the current locale, which is not guaranteed to support UTF-8 (or even support 8-bit characters). Yes, but can't setlocale() temporarily change it to a UTF-8 locale? Granted, there's no guarantee that a UTF-8 locale exists and what it's called if it does exist, but maybe it would be appropriate to have a configure check to find one? Well, unfortunately there's not a wonderful way to determine that (and since nmh gets packaged up that's not a job autoconf can do; it needs to be determined at runtime). I suppose you could run locale -a and look for everything that contains UTF-8. Or ... utf8? Again, not a wonderful solution (and I see some Linux systems have something like sd_IN.utf8@devenagari, which I won't pretend to understand). Really, the POSIX character functions treat the locale and characters themselves as opaque blocks; if you want to do something crazy like override the native locale or work on characters that are not part of the native locale then you're really stepping outside of the POSIX API box. And I guess part of me really wonders why on earth that would be a good idea. --Ken ___ Nmh-workers mailing list Nmh-workers@nongnu.org https://lists.nongnu.org/mailman/listinfo/nmh-workers
Re: [Nmh-workers] nmh architecture discussion: format engine character set
So ... what would that mean, exactly? Ignore the locale setting and always output UTF-8? Well, yes, the code would be writing UTF-8, with the knowledge of how many cells have been occupied, e.g. one for the combining `a⃞', but it could complain about the non-UTF-8 locale setting, or try and set up `fire and forget' converter on open and opening files if it was easy enough to be worth the bother. Help me out here, because I'm trying to translate your concepts into actual code and I'm having some problems seeing how it would work. Assuming we don't bring in a library like ICU, it's difficult for us to reliably determine the width of a Unicode character. Specifically: - The POSIX standard functions for this, wcwidth() and wcswidth(), work on the current locale, which is not guaranteed to support UTF-8 (or even support 8-bit characters). - The xlocale functions which allow one to specify a specific a locale to functions like wcwidth() are not part of POSIX. - Even if we used xlocale (or just overrode the global locale in every nmh program) it turns out there's not a reliable UTF-8 compatible default we can use; we ran into this in the test suite, some people just don't install all of the locales, so we can't assume en_US.UTF-8 (or en_GB.UTF-8, or whatever). I'm unclear how you wnated to use the iconv utility; is the idea just output everything in UTF-8 and run iconv as a filter for all text output? I think that might have unintended consequences, but putting that aside there are other issues. For one, iconv can't do character substitution on conversion failure (at least the POSIX iconv cannot; I am aware that GNU iconv can). Even if it can, I am unsure we can maintain the correct column position when dealing with things like combining characters. But hey, if I'm wrong I'd be glad to hear about it. I think it's a much tougher problem than people realize. --Ken ___ Nmh-workers mailing list Nmh-workers@nongnu.org https://lists.nongnu.org/mailman/listinfo/nmh-workers