Following on from the "Send international text with mail(1)" thread...
There is some interest in making mail(1) add relevant MIME headers to allow: * Correctly sending UTF-8 email. * Identifying 7-bit ASCII emails with an appropriate content type. ... and possibly other things in the future. However: * Adding UTF-8 parsing directly in mail(1) and hard-coding it's behaviour is inflexible. * Exactly what should be sent with UTF-8 headers rather than none or us-ascii is partly down to personal preference: 1. Send everything as UTF-8, including plain ASCII. 2. Send nothing as UTF-8. 3. Send valid UTF-8 streams as UTF-8, everything else ASCII. 4. Send valid and sort-of-valid-but-not-really UTF-8 streams as UTF-8, and plain ASCII as us-ascii. ... etc, etc. Different sites might reasonably have different requirements. Plus, we don't really want to break mail(1) for anybody. As can be seen from the size of the previous thread, finding a universal solution that suits everybody has not yet been possible. In an attempt to solve this, I've produced a proof of concept patch for mail(1) to allow it to call a fixed external program, passing the mail to it on standard input for analysis and receiving back a flag to indicate which set of MIME headers should be included. This is only a POC at this stage, so there may be bugs and room for improvement. But it seems to work. Advantages of this approach: * Very minimal changes to mail(1). * Flexible. * No change for people who don't want this functionality. - If the external validator program is not installed, mail(1) does not add any new headers at all. Cheat sheet: 1. Apply the patch, re-compile and re-install mail(1). 2. Compile the new program and put it in /bin/validate_utf8 . 3. Send mail with mail(1) and observe the headers. For ease of testing by users who don't use or care about UTF-8, the demo validator simply looks for an 'X' character in the mail body, and if it finds one then it treats the mail as UTF-8, everything else is treated as us-ascii. A real UTF-8 validator would return 2 for a valid UTF-8 stream, 1 for ASCII, and 0 for non-conformant data that we don't want to mess with, (E.G. a legacy 8-bit encoding). Have fun! --- collect.c.dist Fri Jan 17 15:42:30 2014 +++ collect.c Sun Sep 24 19:09:04 2023 @@ -39,6 +39,7 @@ #include "rcv.h" #include "extern.h" +#include <sys/wait.h> /* * Read a message from standard output and return a read file to it @@ -62,6 +63,12 @@ char getsub; char linebuf[LINESIZE], tempname[PATHSIZE], *cp; + int val_status; + sigset_t old_sigmask; + sigset_t temp_sigmask; + pid_t pid; + #define VALIDATOR "/bin/validate_utf8" + collf = NULL; eofcount = 0; hadintr = 0; @@ -374,7 +381,77 @@ (void)Fclose(collf); collf = NULL; } + out: + +/* + * Pass the content of the collected file to stdin of a forked + * validator program, and use it's exit status to set a flag + * in the struct header that we can later use to include an + * appropriate content type header. + */ + +rewind(collf); + +/* + * If this fork fails, it's not a critical error. We just don't + * perform any UTF-8 validation in that case. + */ + +pid=fork(); +if (pid==-1) { + goto done; + } + +if (pid==0) { + sigset_t val_sigs; + sigemptyset(&val_sigs); + sigaddset(&val_sigs, SIGHUP); + prepare_child(&val_sigs, fileno(collf), -1); + execl(VALIDATOR, VALIDATOR, NULL); + /* + * If the validator doesn't exist or isn't executable then + * the following exit value will be passed to the parent + * below. Therefore, it must _not_ conflict with an + * expected exit value from the validator. + */ + _exit(127); + } + +/* + * To wait on the forked validator and get it's exit status we need + * to enable SIGCHLD. + */ + +sigemptyset(&temp_sigmask); +sigaddset(&temp_sigmask, SIGCHLD); +sigprocmask(SIG_BLOCK, &temp_sigmask, &old_sigmask); +if (waitpid(pid, &val_status, 0) != -1) { + if (WIFEXITED(val_status)) { + /* + * Only permit _specific_ values. + */ + if (WEXITSTATUS(val_status) != 127) + fprintf (stderr, "Validator %s returned status %d\n", + VALIDATOR, WEXITSTATUS(val_status)); + if (WEXITSTATUS(val_status)==1) + hp->enc_flag=1; + if (WEXITSTATUS(val_status)==2) + hp->enc_flag=2; + } + } + +/* + * Restore previous signal mask now that we are done with the validator. + */ + +sigprocmask(SIG_SETMASK, &old_sigmask, NULL); + +/* + * All done! + */ + +done: if (collf != NULL) rewind(collf); noreset--; --- def.h.dist Fri Jan 28 03:18:41 2022 +++ def.h Sun Sep 24 16:01:03 2023 @@ -176,6 +176,7 @@ struct name *h_cc; /* Carbon copies string */ struct name *h_bcc; /* Blind carbon copies */ struct name *h_smopts; /* Sendmail options */ + unsigned int enc_flag; /* Flag set by external UTF-8 validator */ }; /* --- send.c.dist Wed Mar 8 01:43:11 2023 +++ send.c Sun Sep 24 19:00:37 2023 @@ -309,6 +309,7 @@ head.h_cc = NULL; head.h_bcc = NULL; head.h_smopts = NULL; + head.enc_flag = 0; mail1(&head, 0); return(0); } @@ -529,6 +530,16 @@ fmt("Cc:", hp->h_cc, fo, w&GCOMMA), gotcha++; if (hp->h_bcc != NULL && w & GBCC) fmt("Bcc:", hp->h_bcc, fo, w&GCOMMA), gotcha++; + if (hp->enc_flag == 1) + fprintf(fo, "MIME-Version: 1.0\n" + "Content-Type: text/plain; charset=us-ascii\n" + "Content-Transfer-Encoding: 7bit\n"); + gotcha++; + if (hp->enc_flag == 2) + fprintf(fo, "MIME-Version: 1.0\n" + "Content-Type: text/plain; charset=utf-8\n" + "Content-Transfer-Encoding: 8bit\n"); + gotcha++; if (gotcha && w & GNL) (void)putc('\n', fo); return(0); #include <stdio.h> int main() { /* * Do the UTF-8 parsing of your choice here. * * This demo code just treats anything with an X as UTF-8, everything else as ASCII. * * Input is on stdin. * * Return 0 for no additional headers. * Return 1 for us-ascii headers. * Return 2 for utf-8 headers. * Other return values are undefined but will currently behave like 0, (no additional headers). */ int i; while ((i=getc(stdin)) != EOF) { if (i == 'X') { return (2); } } return(1); }