Re: Loosening mailcap filename sanitization to allow unicode filenames

2021-04-27 Thread Kevin J. McCarthy

On Tue, Apr 27, 2021 at 06:16:59PM -0700, Kevin J. McCarthy wrote:
Answering the question first, Mutt will decode the RFC2231 parameters 
and convert them to $charset.  It looks like the 2231 decoder does run 
values through mutt_filter_unprintable().


I'm looking closer at the 2231 decoder and it actually looks like it
doesn't always do this.  I think that was a mistake, and I'll add a fix.

--
Kevin J. McCarthy
GPG Fingerprint: 8975 A9B3 3AA3 7910 385C  5308 ADEF 7684 8031 6BDA


signature.asc
Description: PGP signature


Re: Loosening mailcap filename sanitization to allow unicode filenames

2021-04-27 Thread Kevin J. McCarthy

On Wed, Apr 28, 2021 at 01:44:03AM +0200, Eike Rathke wrote:

On Tuesday, 2021-04-27 10:17:29 -0700, Kevin J. McCarthy wrote:


What if we added an allow_8bit parameter to the function, that also passed
through bytes with the 8th bit set?  I'd keep this set off in all other
invocations except the mailcap invocations.


Allowing *all* 8-bit may be ill advised. I'd disallow at least resulting
U+0080 to U+009F Unicode control characters (C1 control codes). Also
exclude the non-characters U+FFFE and U+. But, what text encoding
are we actually talking about?


Answering the question first, Mutt will decode the RFC2231 
parameters and convert them to $charset.  It looks like the 2231 decoder 
does run values through mutt_filter_unprintable().


mutt_filter_unprintable() calls mbrtowc() to check for a valid sequence 
and then iswprint() to ensure the character is in the "print" class.


For the first part, I *think* that addresses your concern.  But in any 
case, would those characters cause issues for a tempfile filename?


--
Kevin J. McCarthy
GPG Fingerprint: 8975 A9B3 3AA3 7910 385C  5308 ADEF 7684 8031 6BDA


signature.asc
Description: PGP signature


Re: Loosening mailcap filename sanitization to allow unicode filenames

2021-04-27 Thread Kevin J. McCarthy

On Wed, Apr 28, 2021 at 08:48:04AM +1000, Cameron Simpson wrote:

On 27Apr2021 10:17, Kevin J. McCarthy  wrote:
First remark:

I think we should make clear that this only makes sense when you're
encoding filenames as UTF-8, where all multibyte sequences have a high
bit set. This isn't necessarily the case with other encodings.


I was under the impression other encodings at least shared the standard 
printable ascii 7-bit characters, and that other characters (whether 
single or multi-byte) had the high bit set to distinguish them.


However, charsets are definitely an area I'm weak on.


Second remark:

As one who has long been less than enthused by sanitising filenames,
what exactly are we trying to accomplish when we sanitise a filename?

- avoid trickiness like whitespace and quote characters, which cause a
 little pain for users of the files in scripting settings?

- avoiding $ and ` et al, which cause hazards for the very careless
 script author? (but inly if injected blindly)

- avoiding other shell punctuation like redirections? same issue

- avoiding escape paths such as absolute paths (/etc/passwd, oh root-run
 mutt user?) or ../blah to get out of the scratch area?

Without qualifying these objectives, "sanitisation" means little (or too
much, depending where you stand).


If I had to guess, I would say all of the above.

But, unfortunately the code and sanitization is 20+ years old, so I 
think you'd have to check with Michael or Thomas to get the exact 
details of why. :-) They did leave a big bold faced warning in the 
$mailcap_sanitize manual documentation not to turn it off though.



On the one hand these are temp files, but Mutt already tries to
preserve the filename to make for a nicer user interaction.  It seems
if we can preserve unicode filenames better we ought to do that too.


"Unicode filenames" isn't a meaningful term in UNIX, as the API is C
strings - byte sequences with NUL terminators. I suspect you mean "UTF-8
encoded names", which is the common modern default.


Sorry for my sloppiness.  I should have said to "preserve high-bit bytes 
in filenames."  The filename parameter arrives in RFC2231 encoding, 
which includes a charset parameter.  Mutt will decode and convert this 
value to the machine charset (or $charset if that's specified).


--
Kevin J. McCarthy
GPG Fingerprint: 8975 A9B3 3AA3 7910 385C  5308 ADEF 7684 8031 6BDA


signature.asc
Description: PGP signature


Re: Loosening mailcap filename sanitization to allow unicode filenames

2021-04-27 Thread Eike Rathke
Hi Kevin,

On Tuesday, 2021-04-27 10:17:29 -0700, Kevin J. McCarthy wrote:

> What if we added an allow_8bit parameter to the function, that also passed
> through bytes with the 8th bit set?  I'd keep this set off in all other
> invocations except the mailcap invocations.

Allowing *all* 8-bit may be ill advised. I'd disallow at least resulting
U+0080 to U+009F Unicode control characters (C1 control codes). Also
exclude the non-characters U+FFFE and U+. But, what text encoding
are we actually talking about? 8bit suggests UTF-8 encoding or
ISO-8859-... code pages (not assuming EBCDIC ;P). I don't see that
mutt_buffer_sanitize_filename() or its calling contexts have any notion
about text encoding. Assuming UTF-8 (as path names are supposed to be),
check for valid sequences?

  Eike

-- 
OpenPGP/GnuPG encrypted mail preferred in all private communication.
GPG key 0x6A6CD5B765632D3A - 2265 D7F3 A7B0 95CC 3918  630B 6A6C D5B7 6563 2D3A
Use LibreOffice! https://www.libreoffice.org/


signature.asc
Description: PGP signature


Re: Loosening mailcap filename sanitization to allow unicode filenames

2021-04-27 Thread Cameron Simpson
On 27Apr2021 10:17, Kevin J. McCarthy  wrote:
>Ticket 351 on gitlab (https://gitlab.com/muttmua/mutt/-/issues/351) 
>noted that an attachment 中文名称.txt, when launched via a mailcap 
>viewer, created a tempfile ".txt".

Ouch.

>This is because of the sanitize_filename() functions, which have an 
>allow-list of 
>"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+@{}._-:%/" 
>(with the '/' disabled for filenames).
>
>I'd be reluctant to change sanitization for the %{} or %t 
>expandos, but this does seem to be a bit strict for the filename. 
>Oswald notes in the ticket that 8-bit characters are harmless at the 
>system level (Oswald, feel free to reply/clarify - I'm not trying to 
>put words in your mouth).

First remark:

I think we should make clear that this only makes sense when you're 
encoding filenames as UTF-8, where all multibyte sequences have a high 
bit set. This isn't necessarily the case with other encodings.

Second remark:

As one who has long been less than enthused by sanitising filenames, 
what exactly are we trying to accomplish when we sanitise a filename?

- avoid trickiness like whitespace and quote characters, which cause a 
  little pain for users of the files in scripting settings?

- avoiding $ and ` et al, which cause hazards for the very careless 
  script author? (but inly if injected blindly)

- avoiding other shell punctuation like redirections? same issue

- avoiding escape paths such as absolute paths (/etc/passwd, oh root-run 
  mutt user?) or ../blah to get out of the scratch area?

Without qualifying these objectives, "sanitisation" means little (or too 
much, depending where you stand).

>On the one hand these are temp files, but Mutt already tries to 
>preserve the filename to make for a nicer user interaction.  It seems 
>if we can preserve unicode filenames better we ought to do that too.

"Unicode filenames" isn't a meaningful term in UNIX, as the API is C 
strings - byte sequences with NUL terminators. I suspect you mean "UTF-8 
encoded names", which is the common modern default.

>What if we added an allow_8bit parameter to the function, that also 
>passed through bytes with the 8th bit set?  I'd keep this set off in 
>all other invocations except the mailcap invocations.

Of course, the trickiness is that header things like filenames are, 
IIRC, "bytes". Without a charset, do we inherently know anything about 
them _as characters_?

I'm +1 for allow_8bit if we make it clear in the docs (and implemented 
it correctly in the code) that this refers to the in-filesystem byte 
encoding of the filename. _Not_ hypothetical "Unicode". One person's 
"Unicode" is another's Shift-JIS :-)

https://en.wikipedia.org/wiki/Mojibake

One has but to shift one's shell locale to see this play out.

Cheers,
Cameron Simpson 


Loosening mailcap filename sanitization to allow unicode filenames

2021-04-27 Thread Kevin J. McCarthy
Ticket 351 on gitlab (https://gitlab.com/muttmua/mutt/-/issues/351) 
noted that an attachment 中文名称.txt, when launched via a mailcap 
viewer, created a tempfile ".txt".


This is because of the sanitize_filename() functions, which have an 
allow-list of 
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+@{}._-:%/" 
(with the '/' disabled for filenames).


I'd be reluctant to change sanitization for the %{} or %t 
expandos, but this does seem to be a bit strict for the filename. 
Oswald notes in the ticket that 8-bit characters are harmless at the 
system level (Oswald, feel free to reply/clarify - I'm not trying to put 
words in your mouth).


On the one hand these are temp files, but Mutt already tries to preserve 
the filename to make for a nicer user interaction.  It seems if we can 
preserve unicode filenames better we ought to do that too.


What if we added an allow_8bit parameter to the function, that also 
passed through bytes with the 8th bit set?  I'd keep this set off in all 
other invocations except the mailcap invocations.


Comments?

--
Kevin J. McCarthy
GPG Fingerprint: 8975 A9B3 3AA3 7910 385C  5308 ADEF 7684 8031 6BDA


signature.asc
Description: PGP signature