Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-20 Thread Ken Hornstein
 - For _display_, try to convert all of the characters to the native
 character set (yes, using the locale, dammit!).

OK.  Is that using POSIX, or does it require something extra?

POSIX includes iconv(), which is adequate.  If the Unicode library we
need to use has a charset conversion API that is better, we should
use that (my beef with iconv() is that you cannot give a substitution
character, which requires some awkward handling for dealing with
substitution characters).

 - Reconvert such messages to 'canonical' standard while sending.
 Well, I think just for addresses; leaving everything else as an
 encoded word might not be harmful.  But I'd have to think about it.

The only thing I can think of is if something somewhere suggests a
preferred format when multiple are valid, like an ASCII subject should
be just the subject.  Kind of like how it's annoying Android base64s
bodies IIRC.

AFAIK, this shouldn't be a concern; we already have a fair amount of
code that produces the 'minimal' encoding (e.g., we don't use base64
or q-p unless it's a requirement).

--Ken

___
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers


Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-20 Thread Earl Hood
On Wed, Aug 12, 2015 at 8:55 PM, Ken Hornstein wrote:

 - Handle everything internally as UTF-8.
 - For _display_, try to convert all of the characters to the native
   character set (yes, using the locale, dammit!).
 - For things like _replies_, if we are not in an UTF-8 locale then
   downgrade things like the addresses using RFC 6587 rules (well, the
   subject as well ...  I think the way it would work is the format
   engine would do the encoding for you behind the scenes for all
   components).
 - Reconvert such messages to 'canonical' standard while sending.  Well, I
   think just for addresses; leaving everything else as an encoded word might
   not be harmful.  But I'd have to think about it.
 - But this also makes it clear that the thoughts of having an 'external'
   decoder stage will simply not work; you need to know too much about each
   header, because they're all handled differently.

Sorry for late reply...

The above looks reasonable to me.

The 'external' encoder/decoder is more of a pie-in-the-sky idea of
allowing the encoder system being abstracted so one could plug in
different engines if needed.  Basically, using pipes into and out of it
whenever an encoding/decoding operation is required.

However, if the level of effort to achieve such an abstraction is not
worth any potential benefit, do not bother with it.

Note, there may be some benefit in providing some level of abstraction
for the encoder if there is a concern of nmh getting locked-in code-wise
to a specific library.

--ewh

___
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers


Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-13 Thread Ralph Corderoy
Hi Ken,

 - For _display_, try to convert all of the characters to the native
 character set (yes, using the locale, dammit!).

OK.  Is that using POSIX, or does it require something extra?

 - Reconvert such messages to 'canonical' standard while sending.
 Well, I think just for addresses; leaving everything else as an
 encoded word might not be harmful.  But I'd have to think about it.

The only thing I can think of is if something somewhere suggests a
preferred format when multiple are valid, like an ASCII subject should
be just the subject.  Kind of like how it's annoying Android base64s
bodies IIRC.

Cheers, Ralph.

___
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers


Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-12 Thread Christian Neukirchen
Ralph Corderoy ra...@inputplus.co.uk writes:

 Hi Ken,

   So ... what would that mean, exactly?  Ignore the locale setting
   and always output UTF-8?
 
  Well, yes, the code would be writing UTF-8, with the knowledge of
  how many cells have been occupied, e.g. one for the combining `a⃞',
  but it could complain about the non-UTF-8 locale setting, or try and
  set up `fire and forget' converter on open and opening files if it
  was easy enough to be worth the bother.

 Help me out here, because I'm trying to translate your concepts into
 actual code and I'm having some problems seeing how it would work.

 Geez, how much hand-waving do you want a guy to do?  :-)

 Assuming we don't bring in a library like ICU,

 GNU's libunistring might be an alternative to ICU.
 http://www.gnu.org/software/libunistring/

This small lib could be useful as well, expat-licensed and could even
be vendored:

https://github.com/JuliaLang/utf8proc

-- 
Christian Neukirchen  chneukirc...@gmail.com  http://chneukirchen.org


___
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers


Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-12 Thread Ken Hornstein
Take the reply command.  The first thing it needs to do is read the
original email data to generate the draft template for editing.  The
initial read operation is filtered thru the Encoder first.  The result
is passed into the nmh engine to parse header fields and other jazz to
create the draft message (all of this is done in the UTF8 world).  When
writing the draft, the data is piped thru the encoder then written to
disk before launching the editor (hopefully it is a no-op, but if in a
non-UTF8 locale...).

So I was about to say that we don't know what to do in that case, but
I took a look at RFC 6587.  It turns out that it spells out exactly
how to 'downgrade' a message to only ASCII.  This requires encoding
domains in Punycode, using RFC 2047 and RFC 2231 where appropriate, and
use RFC 2047 for addr-spec if the mailbox name contains UTF-8.

This does not strike me as terrible, and the code is mostly written
(well not to convert U-labels to A-labels, but pretty much every Unicode
string library we've looked at has a Punycode encoder-decoder).

So that suggests to me:

- Handle everything internally as UTF-8.
- For _display_, try to convert all of the characters to the native
  character set (yes, using the locale, dammit!).
- For things like _replies_, if we are not in an UTF-8 locale then downgrade
  things like the addresses using RFC 6587 rules (well, the subject as well ...
  I think the way it would work is the format engine would do the encoding
  for you behind the scenes for all components).
- Reconvert such messages to 'canonical' standard while sending.  Well, I
  think just for addresses; leaving everything else as an encoded word might
  not be harmful.  But I'd have to think about it.
- But this also makes it clear that the thoughts of having an 'external'
  decoder stage will simply not work; you need to know too much about each
  header, because they're all handled differently.

Thoughts?

--Ken

___
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers


Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-12 Thread Paul Fox
ken wrote:
  That problem space lives well outside of nmh. The people to rightly fix
  it are the xterm authors, and people writing keyboard drivers. These
  conversion layers belong inside the terminal I/O drivers, where they
  can fix the problem for everything.
  
  The people who at least have spoken up on this list in the past have not,
  AFAIK, lacked the technical ability to run in an UTF-8 locale; there's
  no work on xterm or terminal drivers that is necessary; that's all been
  done a long time ago.  They have just chosen not to, e.g.:
  
  http://lists.nongnu.org/archive/html/nmh-workers/2012-01/msg00206.html
  http://lists.nongnu.org/archive/html/nmh-workers/2012-01/msg00203.html
  
  As Earl Hood pointed out: Character encoding choices can get quite political.
  
  I confess that I am surprised the UTF-8 or die crowd has been so unaminous
  so far.  No one dissents from this view?  Like I said, it simplifies a WHOLE
  bunch of code (at the cost of adding a new library dependency), so I would
  actually be fine with it.

i don't think the current respondents represent a very wide demographic.

paul

=--
 paul fox, p...@foxharp.boston.ma.us (arlington, ma, where it's 63.3 degrees)

___
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers


Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-12 Thread Ralph Corderoy
Hi Jon,

 I am in no way an expert on this.  But, I won't let that stop me.

That's the spirit!

 The reason why I think that Unicode is appropriate is that it has been
 designed to be a superset of all other character sets.  Being that the
 RFCs allow the mixing of character sets, Unicode allows them to be
 represented without having to encode bank switching.  I realize that
 doing this requires a library that does all of the Unicode character
 handling properly, which is not a trivial task.

If you skim through the Table of Contents at the start of
http://www.gnu.org/software/libunistring/manual/libunistring.html you'll
see it handles a lot of the nitty-gritty for you.  (Other libraries
suggested probably do the same, I just happen to know this one.)

Cheers, Ralph.

___
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers


Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-12 Thread Ralph Corderoy
Hi Ken,

  Any chance you can add a References header?  Only the lack of it is
  breaking my funky mkdir/tree(1)-based threading.  :-)
 
 Huh, I'm not?  I guess I was under the impression that if you didn't
 have one, you could use In-Reply-To.

I think that's true, but I didn't/haven't bothered coding for that
eventuality so I thought I'd mention it anyway.

 Hm, I thought they were the same, but I guess they're not, are they?

No.

 Looks like the References header includes previous Message-IDs.

Yes, handy when one doesn't have the whole thread, say.

 I'll work on updating my wacky replcomps, but will manually include
 one in the short term.

My wacky replgroupcomps has

%; Make References: and In-reply-to: fields for threading.
%; Use (void), (trim) and (putstr) to eat trailing whitespace.
%;
%{message-id}In-reply-to: %{message-id}\n%\
%{message-id}References: \
%{references}%(void{references})%(trim)%(putstr) %\
%(void{message-id})%(trim)%(putstr)\n%\

Cheers, Ralph.

___
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers


Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-12 Thread Ken Hornstein
It appears the basic processing model is a pipeline:

  Raw - [Encoder] - UTF8 - [Processor] - UTF8 - [Encoder] - Output

I understand where you're coming from ... but it's not that simple.

We're going to a point where UTF-8 is going to appear in email
addresses.  That's technically allowed today under the new RFCs.  The
problem then becomes Okay, 'Output' in the above stage needs to be
'Input' when doing message replies.  How, exactly, do we do that?
It's not just a matter of slapping a pipe to iconv on the end of every
command.

--Ken

___
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers


Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-11 Thread Oliver Kiddle
Ken Hornstein wrote:
 Even if it can, I am unsure we can maintain
 the correct column position when dealing with things like combining
 characters.

That is possible. wcwidth() returns 0 for combining characters.

Do we have any specific cases where forcing a UTF-8 assumption actually
helps? The POSIX API is clumsy but the fact that it deals in the current
locale rather than UTF-8 doesn't make much difference. The code needs an
API to know stuff like how wide a string is. Knowing you have a UTF-8
encoding doesn't really gain you anything.

I think it'd be better to focus on real features. So if you want, for
example, character substitution on conversion failure and libunistring
helps then configure can check for it and disable the feature if it
isn't found. As an aside, that particular feature only sounds useful if
you're actually using a non-UTF-8 locale.

Given that nmh is BSD licenced, I'd probably favour libicu over
libunistring just for its licence. Checking on a Debian system, neither
have vast numbers of reverse dependencies.

Oliver

___
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers


Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-11 Thread Ken Hornstein
I am in no way an expert on this.  But, I won't let that stop me.

Welcome to the club!  I think we're all in the same boat in that
regards.

It seems to me that the only solution is to use Unicode internally.
Disgusting as it seems to those of us who are old enough to hoard
bytes, we might want to consider using something other than UTF-8
for the internal representation.  Using UTF-16 wouldn't be horrible
but I recall that the Unicode folks made a botch of things so that
one really needs 24 bits now, which really means using 32 internally.

AFAICT ... there is probably no advantage in using UTF-16 or UTF-32
versus UTF-8.

People might think that you gain something because with UTF-16 two
bytes == 1 character.  But that's only true for things in the Basic
Multilingual Plane, and people are now telling us  because they want
to send emoji in email which are NOT part of the BMP, which means we
have to start dealing with  like surrogate pairs. And really, even
with just the BMP combining characters toss that idea out of the window
UTF-32 lets you say 4 bytes == 1 character ... but do we care about
'characters' or 'column positions'?

So given that, I think sticking with UTF-8 is preferrable; it has the
nice property that we can represent text as C strings and it's just
ASCII if you're living in a 7-bit world.

On the output side, we just have to do the best we can if characters in
the input locale can't be represented in the output locale.  This is
independent of the internal representation.

Well, this works great if your locale is UTF-8.  But ... what happens
if your email address contains UTF-8, and your locale setting is
ISO-8859-1?

--Ken

___
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers


Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-11 Thread Earl Hood

On 8/11/2015 11:33 AM, Ken Hornstein wrote:


Well, this works great if your locale is UTF-8.  But ... what happens
if your email address contains UTF-8, and your locale setting is
ISO-8859-1?


Let me expand on this a bit, because I didn't explain it well.
Obviously if your locale is ISO-8859-1, you probably won't have an email
address that contains UTF-8.  But ... what if you get an email with
a 'From' address that contains UTF-8, , and you want to reply to it?
Right now convert stuff to the local character set when constructing the
reply draft; we can't do that here!


Yep.  One apparent deficiency of internalized email headers is the
inability to encode characters.  The MIME non-ASCII encoding syntax is
limited to specific contexts and not applicable for addresses.

An address encoding syntax should exist for the scenario you describe,
allowing one the encode characters that cannot be represented natively
in the current locale.  However, it seems folks no longer want to
support such environments.

I guess if nmh ever encounters the scenario, it just errors out.

--ewh

___
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers


Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-11 Thread Ken Hornstein
That problem space lives well outside of nmh. The people to rightly fix
it are the xterm authors, and people writing keyboard drivers. These
conversion layers belong inside the terminal I/O drivers, where they
can fix the problem for everything.

The people who at least have spoken up on this list in the past have not,
AFAIK, lacked the technical ability to run in an UTF-8 locale; there's
no work on xterm or terminal drivers that is necessary; that's all been
done a long time ago.  They have just chosen not to, e.g.:

http://lists.nongnu.org/archive/html/nmh-workers/2012-01/msg00206.html
http://lists.nongnu.org/archive/html/nmh-workers/2012-01/msg00203.html

As Earl Hood pointed out: Character encoding choices can get quite political.

I confess that I am surprised the UTF-8 or die crowd has been so unaminous
so far.  No one dissents from this view?  Like I said, it simplifies a WHOLE
bunch of code (at the cost of adding a new library dependency), so I would
actually be fine with it.

--Ken

___
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers


Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-11 Thread Paul Vixie


Ken Hornstein wrote:
 We would be telling everyone if they're not using UTF-8, then we don't
 support you.

 So what does everything think of that?

as long as there's a way to convert my existing mail store (folders
--modernize), i'm game.

note that i also argued for dropping ultrix, sunos3, and every other
non-ansi non-posix system. so, i may be insane.

-- 
Paul Vixie

___
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers


Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-10 Thread Ralph Corderoy
Hi Ken,

   Specifically, they assume all output is in UTF-8 (because that's
   how Plan 9 works), but that's not a valid assumption for us.
 
  Aside from whether that stdio would be helpful, is it time we switch
  to assuming UTF-8?
 
 So ... what would that mean, exactly?  Ignore the locale setting and
 always output UTF-8?

Well, yes, the code would be writing UTF-8, with the knowledge of how
many cells have been occupied, e.g. one for the combining `a⃞', but it
could complain about the non-UTF-8 locale setting, or try and set up
`fire and forget' converter on open and opening files if it was easy
enough to be worth the bother.

Cheers, Ralph.

___
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers


Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-10 Thread Ralph Corderoy
Hi Ken,

  GNU's libunistring might be an alternative to ICU.
  http://www.gnu.org/software/libunistring/
 
 Hm, I just looked at it; it's not terrible, is it?  What do people
 think about creating a dependency on this library?  I'm not sure how
 mature it is, though.

http://git.savannah.gnu.org/cgit/libunistring.git/log/README has it
going back to 2009, with some recent effort.  Bruno has been dabbling in
Unicode for a long time, http://www.haible.de/bruno/packages.html and
wrote the Unicode HOWTO, http://www.tldp.org/HOWTO/Unicode-HOWTO.html

On an Ubuntu system I've access to, package gettext depends on package
libunistring0, so it could be getting some exercise.

Cheers, Ralph.

___
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers


Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-10 Thread Ken Hornstein
 - The POSIX standard functions for this, wcwidth() and wcswidth(), work
   on the current locale, which is not guaranteed to support UTF-8 (or
   even support 8-bit characters).

Yes, but can't setlocale() temporarily change it to a UTF-8 locale?

Granted, there's no guarantee that a UTF-8 locale exists and what it's
called if it does exist, but maybe it would be appropriate to have a
configure check to find one?

Well, unfortunately there's not a wonderful way to determine that (and
since nmh gets packaged up that's not a job autoconf can do; it needs
to be determined at runtime).  I suppose you could run locale -a and
look for everything that contains UTF-8.  Or ... utf8?  Again, not a
wonderful solution (and I see some Linux systems have something like
sd_IN.utf8@devenagari, which I won't pretend to understand).

Really, the POSIX character functions treat the locale and characters
themselves as opaque blocks; if you want to do something crazy like
override the native locale or work on characters that are not part of
the native locale then you're really stepping outside of the POSIX API
box.  And I guess part of me really wonders why on earth that would be a
good idea.

--Ken

___
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers


Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-10 Thread Ken Hornstein
 So ... what would that mean, exactly?  Ignore the locale setting and
 always output UTF-8?

Well, yes, the code would be writing UTF-8, with the knowledge of how
many cells have been occupied, e.g. one for the combining `a⃞', but it
could complain about the non-UTF-8 locale setting, or try and set up
`fire and forget' converter on open and opening files if it was easy
enough to be worth the bother.

Help me out here, because I'm trying to translate your concepts into
actual code and I'm having some problems seeing how it would work.

Assuming we don't bring in a library like ICU, it's difficult for us
to reliably determine the width of a Unicode character.  Specifically:

- The POSIX standard functions for this, wcwidth() and wcswidth(), work
  on the current locale, which is not guaranteed to support UTF-8 (or
  even support 8-bit characters).

- The xlocale functions which allow one to specify a specific a locale
  to functions like wcwidth() are not part of POSIX.

- Even if we used xlocale (or just overrode the global locale in every
  nmh program) it turns out there's not a reliable UTF-8 compatible
  default we can use; we ran into this in the test suite, some people
  just don't install all of the locales, so we can't assume en_US.UTF-8
  (or en_GB.UTF-8, or whatever).

I'm unclear how you wnated to use the iconv utility; is the idea just
output everything in UTF-8 and run iconv as a filter for all text
output?  I think that might have unintended consequences, but putting
that aside there are other issues.  For one, iconv can't do character
substitution on conversion failure (at least the POSIX iconv cannot; I
am aware that GNU iconv can).  Even if it can, I am unsure we can maintain
the correct column position when dealing with things like combining
characters.

But hey, if I'm wrong I'd be glad to hear about it.  I think it's a much
tougher problem than people realize.

--Ken

___
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers