On Thu, Aug 08, 2024 at 09:54:26AM GMT, Markus Armbruster wrote: > Eric Blake <ebl...@redhat.com> writes: > > > My next patch needs to convert text from an untrusted input into an > > output representation that is suitable for display on a terminal is > > useful to more than just the json-writer; the text should normally be > > UTF-8, but blindly allowing all Unicode code points (including ASCII > > ESC) through to a terminal risks remote-code-execution attacks on some > > terminals. Extract the existing body of json-writer's quoted_strinto > > a new helper routine mod_utf8_sanitize, and generalize it to also work > > on data that is length-limited rather than NUL-terminated. > > This is two changes in one patch: factor out, and generalize. > > You don't actually use the generalized variant. Please leave that for > later, and keep this patch simple.
Makes sense. Will simplify in v2. > > > [I was > > actually surprised that glib does not have such a sanitizer already - > > Google turns up lots of examples of rolling your own string > > sanitizer.] See https://gitlab.gnome.org/GNOME/glib/-/issues/3426 where I asked if glib should provide a more generic sanitizer. In turn, I found glib does have: char* g_uri_escape_string ( const char* unescaped, const char* reserved_chars_allowed, gboolean allow_utf8 ) which is a bit more powerful (you have some fine-tuning on what gets escaped), but a different escaping mechanism (%XX instead of \uXXXX escapes, where % rather than \ is special). I'm not sure whether it is nicer to have a malloc'd result or to append into a larger g_string as the base operation (and where you could write an easy wrapper to provide the other operation). > > +/** > > + * mod_utf8_sanitize: > > + * @buf: Destination buffer > > + * @str: Modified UTF-8 string to sanitize > > + * > > + * Append the contents of the NUL-terminated Modified UTF-8 string > > + * @str into @buf, with escape sequences used for non-printable ASCII > > "Append into" or "append to"? "append to" works, will simplify. > > > + * and for quote characters. The result is therefore safe for output > > + * to a terminal. > > Actually, we escape double quote, backslash, ASCII control characters, > and non-ASCII characters, i.e. we leave just printable ASCII characters > other than '"' and '\\' unescaped. See the code quoted below. > > Escaping '\\' is of course necessary. > > Escaping '"' is necessary only for callers that want to enclose the > result in double quotes without ambiguity. Which is what the existing > caller wants. Future callers may prefer not to escape '"'. But we can > worry about this when we run into such callers. If we encounter more users, I could see this morphing into a backend that takes a flag argument on knobs like whether to escape " or ', whether to preserve printing Unicode, ...; coupled with frontends with fewer arguments that pass the right flags intot the backend. > > > + * > > + * Modified UTF-8 is exactly like UTF-8, except U+0000 is encoded as > > + * "\xC0\x80". > > + */ > > Missing: behavior for invalid sequences. Each such sequence is replaced > by a replacement character U+FFFD. For the definition of "invalid > sequence", see mod_utf8_codepoint(). Indeed, more documentation here wouldn't hurt (both on \ and " being escaped, and on U+FFFD codepoints being placed into the output stream). > > On the function name. "Sanitize" could be many things. This function > actually does two things: (1) replace invalid sequences, and (2) escape > to printable ASCII. What about append_mod_utf8_as_printable_ascii()? > Admittedly a mouthful. Naming is always tough. Your suggestion is longer, but indeed accurate. Maybe a shorter compromise of append_escaped_mod_utf8()? > > > +void mod_utf8_sanitize(GString *buf, const char *str) > > +{ > > + mod_utf8_sanitize_len(buf, str, -1); > > +} -- Eric Blake, Principal Software Engineer Red Hat, Inc. Virtualization: qemu.org | libguestfs.org