I spent the better part of today investigating why we appear to have
problems with our patch emails when the contents is are 7-bit ASCII.
I'be been through the source code of git-format-patch and
git-send-email, refreshed my memory of various RFC's, and have performed
a number of experiments.
The Details:
------------
If an email, or any mime part of an email does *not* specify a
Content-Type with a charset parameter then the encoding defaults to
7-bit US-ASCII.
That hasn't been a problem in the past because virtually all our patches
have been restricted to 7-bit ASCII so we never really noticed the
problem. However more recently we been sending files with UTF-8 encoded
values and we started to see what appeared to be corruption in the
patch. This was most noticeable when the mail passed through Mailman,
some versions of which attempt to transcode the email to match the list
preferences.
Here is what is actually going on:
git-format-patch does *not* set the charset when it formats the email.
Without a charset specified anybody handling the email according to the
RFC's are supposed to treat the body as 7-bit ASCII. Thus patches which
contain UTF-8 characters outside the range of 7-bit ASCII will have the
potential to be mangled because the content (8-bit UTF-8) does not match
the content declaration (7-bit ASCII).
You can instruct git-format-patch to add arbitrary email headers. Thus
we can force git-format-patch to provide the correct content
declaration. This is best done by adding a format parameter to your
~/.gitconfig like this:
[format]
headers = "Content-Type: text/plain;
charset=\"utf-8\"\nContent-Transfer-Encoding: 8bit\n"
When you do this the headers at the top of your formatted patch will
include:
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
While that's a good start *it's not enough* Why?
Those extra email headers have to actually make it into the email
headers being sent via SMTP. Just because those header lines are sitting
in a file which might be attached to the email you're sending is not
sufficient. Why? Because the patch being attached becomes part of the
email body, the headers in the patch file never make it into the actual
SMTP email headers.
Sending a patch as an attachment prevents the Content-Type from being
inserted into the headers of the email, that's why when we send patches
as an attachment they get their UTF-8 content screwed up.
So, is there a solution which causes the headers inserted by
git-format-patch to become part of the actual email headers? Yes, it's
called git-send-email. git-send-email is designed to send a git patch
and as such it knows how to parse a patch formatted by git-format-patch.
One of the things it does is look for email headers in the
git-format-patch file and inserts them into the SMTP headers for the email.
What happens when you send a patch as an attachment?
This causes the email body to become a collection of mime multiparts.
Each mime part *should* have it's own Content-Type declaration.
Unfortunately neither "git-send-email --attach" nor Thunderbird when
attaching a patch set the charset parameter of the Content-Type
declaration for the patch content. Remember the rules of the various
RFC's, if the charset parameter is absent the encoding of the content is
to be interpreted as 7-bit ASCII. So any of our patches which contain
UTF-8 can be mangled because we've violated the rules of email. We sent
something which we implicitly declared was 7-bit ASCII but was actually
8-bit UTF-8!
I could find no way to make Thunderbird use a specific Content-Type when
sending a patch file. git-send-email with the attach option has the
Content-Type hardcoded which I consider a bug. Unfortunately I couldn't
find a git bug reporting tool to report this bug.
If you have git-format-patch add the Content-Type and use git-send-email
to send the patch it will be *correct*. Why? Because git-send-email will
insert the Content-Type into the SMTP header which will apply to the
*entire* email body because there are *no* mime multipart's as there
would be if it were an attachment. If you do that not only will the
patch be sent correctly, but it will also display correctly in your
email reader!
But isn't there another way to send the patch without it getting it's
charset screwed up? Yes, you can send the patch as binary data (e.g.
base64 encoded), which implies it must be an attachment.
What happens when you base64 encode the attachment? Basically it means
to the mail handling components along the way "keep your grubby hands
off, do not try to interpret this". That's both good and bad for us.
It's good that the UTF-8 encoded patch makes it through the mail system
unscathed, but mail readers have no clue how to properly display it, in
fact they won't even try. That means you can't read the patch in your
mail reader, you'll have to save it which invokes the base64 decoding,
then you can open it as a file.
But wait a minute! I've seen base64 encoded patches on this email list
and I can read the patch. You're lying! Well, that might be because of
this email Stephen sent out a while ago.
> Stephen Gallagher wrote:
>
The latest versions of git (including that shipped with Fedora 12)
has some trouble parsing patch files sent through mailman that are
encoded as "Content-Type: text/plain;"
Thunderbird can be made to send all attachments in base64-encoded
form (which should be safe for mailman) by changing the following
settings.
In Thunderbird Preferences, go to the Advanced->General tab. Select
"Config Editor"
Search for mail.file_attach_binary and set this value to true.
Now all of your attachments will be base64-encoded. Yes, it increases
the filesize somewhat, but accuracy > bandwidth.
So what's going on in this case? Let's follow the steps. You start by
attaching a patch file, whose Content-Type is correctly determined to be
"text/x-patch", then it's base64 encoded and sent as an attachment. On
the receiving end the mime part containing the patch has these headers:
Content-Type: text/x-patch;
Content-Transfer-Encoding: base64
So the mail reader (Thunderbird in my instance) sees this was
transferred as base64 and decodes it. Then it looks at the Content-Type
and sees that it's text (but *without* the charset parameter). So the
mail reader displays it as 7-bit ASCII. O.K. I lied a little.
Thunderbird actually has a configuration which specifies the default
charset if one is not found, which defaults to ISO-8859-1, this is
somewhat in violation of the RFC's but ISO-8859-1 has become a practical
default in practice. So in this case Thunderbird tries to display the
UTF-8 encoded patch as ISO-8859-1 text. This is a *display problem
only*, the actual data is correct because it was sent as base64 and thus
was never mucked with, but Thunderbird is displaying it incorrectly.
There is a manual work around for this in Thunderbird. If you're looking
a patch which looks like it's rendered with the wrong encoding (e.g.
charset) then go to the View --> Character Encoding menu and select UTF-8.
Let's go back a minute to Stephen's assertion that Mailman is screwing
up the patches and we need to have Thunderbird base64 encode them to
prevent mailman from mucking with them. This really isn't a mailman
problem, rather it's our problem with how we're sending patches. We've
lied to mailman and all the other components which might handle the mail
along the way. We told those mail systems the body of the mail was 7-bit
ASCII (because we omitted the charset parameter in the mail header) and
then inserted 8-bit UTF-8 into the mail body. That kind of a lie won't
bite you until one of the mail components decides to transcode the mail
body. One of the features of mailman is transcoding to match the list's
encoding preference. The fact mailman corrupted the mail body is not
mailman's fault because we lied to mailman about the encoding of the
mail body, the old saying holds true "garbage in; garbage out".
O.K. so what our options here?
1) Continue to send patches the way we have making sure Thunderbird is
configured to base64 encode them. Accept the fact that when displayed in
a mail reader any UTF-8 will be garbled and you have to manually force
Thunderbird to render the patch in UTF-8. The contents of the patch
remains uncorrupted, it's just a display issue in the mail reader.
2) Configure git-send-email to add the correct SMTP headers and use
git-send-email. This is probably preferred because it's actually correct
from an RFC standpoint.
Option 2 is actually pretty easy to use. My ~/.gitconfig is set up like
this:
[sendemail]
smtpserver = smtp.corp.redhat.com
to = freeipa-devel@redhat.com
from = John Dennis <jden...@redhat.com>
confirm = never
[format]
headers = "Content-Type: text/plain;
charset=\"utf-8\"\nContent-Transfer-Encoding: 8bit\n"
Those defaults in my .gitconfig means I never have to add any command
line args to either git-format-patch or git-send-email, it's as easy as:
% git format-patch -1
% git send-email 0001-some-patch-file
The downside of using git-send-email is whoever is applying the patch
will have to save the entire email to a file instead of an attachment,
which might be slightly more awkward. But as you can see from above it's
very hard, and in most cases impossible, to get a patch sent as an
attachment to have the correct charset specified. This is a pretty
serious shortcoming and calls into question the use of attachments in
the first place.
--
John Dennis <jden...@redhat.com>
Looking to carve out IT costs?
www.redhat.com/carveoutcosts/
_______________________________________________
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel