Re: Loosening mailcap filename sanitization to allow unicode filenames
On Tue, Apr 27, 2021 at 06:16:59PM -0700, Kevin J. McCarthy wrote: Answering the question first, Mutt will decode the RFC2231 parameters and convert them to $charset. It looks like the 2231 decoder does run values through mutt_filter_unprintable(). I'm looking closer at the 2231 decoder and it actually looks like it doesn't always do this. I think that was a mistake, and I'll add a fix. -- Kevin J. McCarthy GPG Fingerprint: 8975 A9B3 3AA3 7910 385C 5308 ADEF 7684 8031 6BDA signature.asc Description: PGP signature
Re: Loosening mailcap filename sanitization to allow unicode filenames
On Wed, Apr 28, 2021 at 01:44:03AM +0200, Eike Rathke wrote: On Tuesday, 2021-04-27 10:17:29 -0700, Kevin J. McCarthy wrote: What if we added an allow_8bit parameter to the function, that also passed through bytes with the 8th bit set? I'd keep this set off in all other invocations except the mailcap invocations. Allowing *all* 8-bit may be ill advised. I'd disallow at least resulting U+0080 to U+009F Unicode control characters (C1 control codes). Also exclude the non-characters U+FFFE and U+. But, what text encoding are we actually talking about? Answering the question first, Mutt will decode the RFC2231 parameters and convert them to $charset. It looks like the 2231 decoder does run values through mutt_filter_unprintable(). mutt_filter_unprintable() calls mbrtowc() to check for a valid sequence and then iswprint() to ensure the character is in the "print" class. For the first part, I *think* that addresses your concern. But in any case, would those characters cause issues for a tempfile filename? -- Kevin J. McCarthy GPG Fingerprint: 8975 A9B3 3AA3 7910 385C 5308 ADEF 7684 8031 6BDA signature.asc Description: PGP signature
Re: Loosening mailcap filename sanitization to allow unicode filenames
On Wed, Apr 28, 2021 at 08:48:04AM +1000, Cameron Simpson wrote: On 27Apr2021 10:17, Kevin J. McCarthy wrote: First remark: I think we should make clear that this only makes sense when you're encoding filenames as UTF-8, where all multibyte sequences have a high bit set. This isn't necessarily the case with other encodings. I was under the impression other encodings at least shared the standard printable ascii 7-bit characters, and that other characters (whether single or multi-byte) had the high bit set to distinguish them. However, charsets are definitely an area I'm weak on. Second remark: As one who has long been less than enthused by sanitising filenames, what exactly are we trying to accomplish when we sanitise a filename? - avoid trickiness like whitespace and quote characters, which cause a little pain for users of the files in scripting settings? - avoiding $ and ` et al, which cause hazards for the very careless script author? (but inly if injected blindly) - avoiding other shell punctuation like redirections? same issue - avoiding escape paths such as absolute paths (/etc/passwd, oh root-run mutt user?) or ../blah to get out of the scratch area? Without qualifying these objectives, "sanitisation" means little (or too much, depending where you stand). If I had to guess, I would say all of the above. But, unfortunately the code and sanitization is 20+ years old, so I think you'd have to check with Michael or Thomas to get the exact details of why. :-) They did leave a big bold faced warning in the $mailcap_sanitize manual documentation not to turn it off though. On the one hand these are temp files, but Mutt already tries to preserve the filename to make for a nicer user interaction. It seems if we can preserve unicode filenames better we ought to do that too. "Unicode filenames" isn't a meaningful term in UNIX, as the API is C strings - byte sequences with NUL terminators. I suspect you mean "UTF-8 encoded names", which is the common modern default. Sorry for my sloppiness. I should have said to "preserve high-bit bytes in filenames." The filename parameter arrives in RFC2231 encoding, which includes a charset parameter. Mutt will decode and convert this value to the machine charset (or $charset if that's specified). -- Kevin J. McCarthy GPG Fingerprint: 8975 A9B3 3AA3 7910 385C 5308 ADEF 7684 8031 6BDA signature.asc Description: PGP signature
Re: Loosening mailcap filename sanitization to allow unicode filenames
Hi Kevin, On Tuesday, 2021-04-27 10:17:29 -0700, Kevin J. McCarthy wrote: > What if we added an allow_8bit parameter to the function, that also passed > through bytes with the 8th bit set? I'd keep this set off in all other > invocations except the mailcap invocations. Allowing *all* 8-bit may be ill advised. I'd disallow at least resulting U+0080 to U+009F Unicode control characters (C1 control codes). Also exclude the non-characters U+FFFE and U+. But, what text encoding are we actually talking about? 8bit suggests UTF-8 encoding or ISO-8859-... code pages (not assuming EBCDIC ;P). I don't see that mutt_buffer_sanitize_filename() or its calling contexts have any notion about text encoding. Assuming UTF-8 (as path names are supposed to be), check for valid sequences? Eike -- OpenPGP/GnuPG encrypted mail preferred in all private communication. GPG key 0x6A6CD5B765632D3A - 2265 D7F3 A7B0 95CC 3918 630B 6A6C D5B7 6563 2D3A Use LibreOffice! https://www.libreoffice.org/ signature.asc Description: PGP signature
Re: Loosening mailcap filename sanitization to allow unicode filenames
On 27Apr2021 10:17, Kevin J. McCarthy wrote: >Ticket 351 on gitlab (https://gitlab.com/muttmua/mutt/-/issues/351) >noted that an attachment 中文名称.txt, when launched via a mailcap >viewer, created a tempfile ".txt". Ouch. >This is because of the sanitize_filename() functions, which have an >allow-list of >"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+@{}._-:%/" >(with the '/' disabled for filenames). > >I'd be reluctant to change sanitization for the %{} or %t >expandos, but this does seem to be a bit strict for the filename. >Oswald notes in the ticket that 8-bit characters are harmless at the >system level (Oswald, feel free to reply/clarify - I'm not trying to >put words in your mouth). First remark: I think we should make clear that this only makes sense when you're encoding filenames as UTF-8, where all multibyte sequences have a high bit set. This isn't necessarily the case with other encodings. Second remark: As one who has long been less than enthused by sanitising filenames, what exactly are we trying to accomplish when we sanitise a filename? - avoid trickiness like whitespace and quote characters, which cause a little pain for users of the files in scripting settings? - avoiding $ and ` et al, which cause hazards for the very careless script author? (but inly if injected blindly) - avoiding other shell punctuation like redirections? same issue - avoiding escape paths such as absolute paths (/etc/passwd, oh root-run mutt user?) or ../blah to get out of the scratch area? Without qualifying these objectives, "sanitisation" means little (or too much, depending where you stand). >On the one hand these are temp files, but Mutt already tries to >preserve the filename to make for a nicer user interaction. It seems >if we can preserve unicode filenames better we ought to do that too. "Unicode filenames" isn't a meaningful term in UNIX, as the API is C strings - byte sequences with NUL terminators. I suspect you mean "UTF-8 encoded names", which is the common modern default. >What if we added an allow_8bit parameter to the function, that also >passed through bytes with the 8th bit set? I'd keep this set off in >all other invocations except the mailcap invocations. Of course, the trickiness is that header things like filenames are, IIRC, "bytes". Without a charset, do we inherently know anything about them _as characters_? I'm +1 for allow_8bit if we make it clear in the docs (and implemented it correctly in the code) that this refers to the in-filesystem byte encoding of the filename. _Not_ hypothetical "Unicode". One person's "Unicode" is another's Shift-JIS :-) https://en.wikipedia.org/wiki/Mojibake One has but to shift one's shell locale to see this play out. Cheers, Cameron Simpson
Loosening mailcap filename sanitization to allow unicode filenames
Ticket 351 on gitlab (https://gitlab.com/muttmua/mutt/-/issues/351) noted that an attachment 中文名称.txt, when launched via a mailcap viewer, created a tempfile ".txt". This is because of the sanitize_filename() functions, which have an allow-list of "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+@{}._-:%/" (with the '/' disabled for filenames). I'd be reluctant to change sanitization for the %{} or %t expandos, but this does seem to be a bit strict for the filename. Oswald notes in the ticket that 8-bit characters are harmless at the system level (Oswald, feel free to reply/clarify - I'm not trying to put words in your mouth). On the one hand these are temp files, but Mutt already tries to preserve the filename to make for a nicer user interaction. It seems if we can preserve unicode filenames better we ought to do that too. What if we added an allow_8bit parameter to the function, that also passed through bytes with the 8th bit set? I'd keep this set off in all other invocations except the mailcap invocations. Comments? -- Kevin J. McCarthy GPG Fingerprint: 8975 A9B3 3AA3 7910 385C 5308 ADEF 7684 8031 6BDA signature.asc Description: PGP signature