I spent the better part of today investigating why we appear to have problems with our patch emails when the contents is are 7-bit ASCII. I'be been through the source code of git-format-patch and git-send-email, refreshed my memory of various RFC's, and have performed a number of experiments.

The Details:
------------

If an email, or any mime part of an email does *not* specify a Content-Type with a charset parameter then the encoding defaults to 7-bit US-ASCII.

That hasn't been a problem in the past because virtually all our patches have been restricted to 7-bit ASCII so we never really noticed the problem. However more recently we been sending files with UTF-8 encoded values and we started to see what appeared to be corruption in the patch. This was most noticeable when the mail passed through Mailman, some versions of which attempt to transcode the email to match the list preferences.

Here is what is actually going on:

git-format-patch does *not* set the charset when it formats the email. Without a charset specified anybody handling the email according to the RFC's are supposed to treat the body as 7-bit ASCII. Thus patches which contain UTF-8 characters outside the range of 7-bit ASCII will have the potential to be mangled because the content (8-bit UTF-8) does not match the content declaration (7-bit ASCII).

You can instruct git-format-patch to add arbitrary email headers. Thus we can force git-format-patch to provide the correct content declaration. This is best done by adding a format parameter to your ~/.gitconfig like this:

[format]
headers = "Content-Type: text/plain; charset=\"utf-8\"\nContent-Transfer-Encoding: 8bit\n"

When you do this the headers at the top of your formatted patch will include:

Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit

While that's a good start *it's not enough* Why?

Those extra email headers have to actually make it into the email headers being sent via SMTP. Just because those header lines are sitting in a file which might be attached to the email you're sending is not sufficient. Why? Because the patch being attached becomes part of the email body, the headers in the patch file never make it into the actual SMTP email headers.

Sending a patch as an attachment prevents the Content-Type from being inserted into the headers of the email, that's why when we send patches as an attachment they get their UTF-8 content screwed up.

So, is there a solution which causes the headers inserted by git-format-patch to become part of the actual email headers? Yes, it's called git-send-email. git-send-email is designed to send a git patch and as such it knows how to parse a patch formatted by git-format-patch. One of the things it does is look for email headers in the git-format-patch file and inserts them into the SMTP headers for the email.

What happens when you send a patch as an attachment?

This causes the email body to become a collection of mime multiparts. Each mime part *should* have it's own Content-Type declaration. Unfortunately neither "git-send-email --attach" nor Thunderbird when attaching a patch set the charset parameter of the Content-Type declaration for the patch content. Remember the rules of the various RFC's, if the charset parameter is absent the encoding of the content is to be interpreted as 7-bit ASCII. So any of our patches which contain UTF-8 can be mangled because we've violated the rules of email. We sent something which we implicitly declared was 7-bit ASCII but was actually 8-bit UTF-8!

I could find no way to make Thunderbird use a specific Content-Type when sending a patch file. git-send-email with the attach option has the Content-Type hardcoded which I consider a bug. Unfortunately I couldn't find a git bug reporting tool to report this bug.

If you have git-format-patch add the Content-Type and use git-send-email to send the patch it will be *correct*. Why? Because git-send-email will insert the Content-Type into the SMTP header which will apply to the *entire* email body because there are *no* mime multipart's as there would be if it were an attachment. If you do that not only will the patch be sent correctly, but it will also display correctly in your email reader!

But isn't there another way to send the patch without it getting it's charset screwed up? Yes, you can send the patch as binary data (e.g. base64 encoded), which implies it must be an attachment.

What happens when you base64 encode the attachment? Basically it means to the mail handling components along the way "keep your grubby hands off, do not try to interpret this". That's both good and bad for us. It's good that the UTF-8 encoded patch makes it through the mail system unscathed, but mail readers have no clue how to properly display it, in fact they won't even try. That means you can't read the patch in your mail reader, you'll have to save it which invokes the base64 decoding, then you can open it as a file.

But wait a minute! I've seen base64 encoded patches on this email list and I can read the patch. You're lying! Well, that might be because of this email Stephen sent out a while ago.

> Stephen Gallagher wrote:
>
The latest versions of git (including that shipped with Fedora 12)
has some trouble parsing patch files sent through mailman that are
encoded as "Content-Type: text/plain;"

Thunderbird can be made to send all attachments in base64-encoded
form (which should be safe for mailman) by changing the following
settings.

In Thunderbird Preferences, go to the Advanced->General tab. Select
"Config Editor"

Search for mail.file_attach_binary and set this value to true.

Now all of your attachments will be base64-encoded. Yes, it increases
the filesize somewhat, but accuracy > bandwidth.

So what's going on in this case? Let's follow the steps. You start by attaching a patch file, whose Content-Type is correctly determined to be "text/x-patch", then it's base64 encoded and sent as an attachment. On the receiving end the mime part containing the patch has these headers:

Content-Type: text/x-patch;
Content-Transfer-Encoding: base64

So the mail reader (Thunderbird in my instance) sees this was transferred as base64 and decodes it. Then it looks at the Content-Type and sees that it's text (but *without* the charset parameter). So the mail reader displays it as 7-bit ASCII. O.K. I lied a little. Thunderbird actually has a configuration which specifies the default charset if one is not found, which defaults to ISO-8859-1, this is somewhat in violation of the RFC's but ISO-8859-1 has become a practical default in practice. So in this case Thunderbird tries to display the UTF-8 encoded patch as ISO-8859-1 text. This is a *display problem only*, the actual data is correct because it was sent as base64 and thus was never mucked with, but Thunderbird is displaying it incorrectly.

There is a manual work around for this in Thunderbird. If you're looking a patch which looks like it's rendered with the wrong encoding (e.g. charset) then go to the View --> Character Encoding menu and select UTF-8.

Let's go back a minute to Stephen's assertion that Mailman is screwing up the patches and we need to have Thunderbird base64 encode them to prevent mailman from mucking with them. This really isn't a mailman problem, rather it's our problem with how we're sending patches. We've lied to mailman and all the other components which might handle the mail along the way. We told those mail systems the body of the mail was 7-bit ASCII (because we omitted the charset parameter in the mail header) and then inserted 8-bit UTF-8 into the mail body. That kind of a lie won't bite you until one of the mail components decides to transcode the mail body. One of the features of mailman is transcoding to match the list's encoding preference. The fact mailman corrupted the mail body is not mailman's fault because we lied to mailman about the encoding of the mail body, the old saying holds true "garbage in; garbage out".

O.K. so what our options here?

1) Continue to send patches the way we have making sure Thunderbird is configured to base64 encode them. Accept the fact that when displayed in a mail reader any UTF-8 will be garbled and you have to manually force Thunderbird to render the patch in UTF-8. The contents of the patch remains uncorrupted, it's just a display issue in the mail reader.

2) Configure git-send-email to add the correct SMTP headers and use git-send-email. This is probably preferred because it's actually correct from an RFC standpoint.

Option 2 is actually pretty easy to use. My ~/.gitconfig is set up like this:

[sendemail]
        smtpserver = smtp.corp.redhat.com
        to = freeipa-devel@redhat.com
        from = John Dennis <jden...@redhat.com>
        confirm = never
[format]
headers = "Content-Type: text/plain; charset=\"utf-8\"\nContent-Transfer-Encoding: 8bit\n"

Those defaults in my .gitconfig means I never have to add any command line args to either git-format-patch or git-send-email, it's as easy as:

% git format-patch -1
% git send-email 0001-some-patch-file

The downside of using git-send-email is whoever is applying the patch will have to save the entire email to a file instead of an attachment, which might be slightly more awkward. But as you can see from above it's very hard, and in most cases impossible, to get a patch sent as an attachment to have the correct charset specified. This is a pretty serious shortcoming and calls into question the use of attachments in the first place.

--
John Dennis <jden...@redhat.com>

Looking to carve out IT costs?
www.redhat.com/carveoutcosts/

_______________________________________________
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel

Reply via email to