[Rd] Possible Bug In Validation of UTF-8 Sequences

brodie gaslam via R-devel Sat, 04 Apr 2020 12:00:21 -0700

As per `?intToUtf8`, and in the comments to `valid_utf8`[1], R 
intends to prevent illegal UTF-8 such as UTF-8 encoded
UTF-16 surrogate pairs.  `R_nchar`, invoked via `base::nchar`,
explicitly validates UTF-8 strings[2], but allows the surrogate:


    > Encoding('\ud800')
    [1] "UTF-8"
    > nchar('\ud800')  // should be an error
    [1] 1

The problem manifests on systems where `char` is signed.  The logic
used to test for the forbidden sequences implicitly assumes that
`char` is unsigned[3]:

    if (c == 0xe0 && (d & 0x20) == 0) return 1;
    if (c == 0xed && d >= 0xa0) return 1;

Notice the `d >= 0xa0`.  On a system where `char` is signed this can
only ever be true if a byte has more than 8 bits, as otherwise the
maximum value of `d` is 0x7f because `d` is retrieved from a plain
`char` pointer[4] (d is `int`):

    if (((d = *(++p)) & 0xc0) != 0x80) return 1;

Where `p` is defined as[5]:

     const char *p;

In contrast `c` above is correctly cast to `unsigned char` prior to
use[8]:

     c = (unsigned char)*p;

I attach a simple patch to address this.

I also include a patch to remove the handling of surrogates from
`R_nchar` as that should not longer be necessary, and additionally the
current handling appears incorrect.  AFAICT, the current handling
attempts to decode a surrogate pair by combining the high surrogate
with the same high surrogate, instead of the high surrogate with the
subsequent character that hopefully is the low surrogate[7].

Here is some code that could be added to regression tests:

    surr_good <- '\ud840\udc00'            # auto-converts to normal
    surr_bad <- paste0('\ud840', '\udc00') # surrogates remain
    good <- c('hello', 'world', surr_good, '\ud7ff', '\ue000', '\U0010ffff')
    bad <- c(surr_bad, '\ud800', '\udfff', '\U00110000')

On R3.6.3:

    nchar(good, allowNA=TRUE)
    [1] 5 5 1 1 1 1
    nchar(bad, allowNA=TRUE)
    [1] 2 1 1 1

On R-devel (2020-03-31 r78116) w/ patch:

    nchar(good, allowNA=TRUE)
    [1] 5 5 1 1 1 1
    nchar(bad, allowNA=TRUE)
    [1] NA NA NA NA

I ran `make check-devel` successfully, although I did have to suppress
one PCRE test[9] that segfaults on my particular set-up, though that
segfaulted prior to my patch as well.

The patch does not prevent the creation of illegal UTF-8 strings,
although I imagine it would be straightforward to add checks to the
entry points into CHARSXPs if that were desired.

Finally, this signed char business hints at a potential broader issue.
If I understand correctly importing byte sequences with values greater
than 0x7f overflows the `char` buffers on systems with signed chars
and octet (or lesser) bytes, e.g. as in `do_readLines`[6] where an
integer procured via `fgetc` is subsequently stored in a `char`
buffer.  Obviously this hasn't mattered much to date, presumably
because the implementations R runs on define the `unsigned char` to
`signed char` conversion in such a way that the `signed char` to
`unsigned char` conversion restores the original value.  I don't know
if this is something explicitly checked for like the `int` == 32bits
assumption.

[1]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/valid_utf8.h#L61
[2]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/character.c#L148
[3]: 
https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/valid_utf8.h#L106
[4]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/valid_utf8.h#L84
[5]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/valid_utf8.h#L69
[6]: 
https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/connections.c#L3935
[7]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/character.c#L184
[8]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/valid_utf8.h#L73
[9]: https://github.com/wch/r-source/blob/tags/R-3-6-3/tests/PCRE.R#L16

Index: src/main/valid_utf8.h
===================================================================
--- src/main/valid_utf8.h       (revision 78116)
+++ src/main/valid_utf8.h       (working copy)
@@ -66,11 +66,11 @@
 static int
 valid_utf8(const char *string, size_t length) // R change int->size_t
 {
-    const char *p;
+    const unsigned char *p;
 
     for (p = string; length-- > 0; p++) {
-       int ab, c, d;
-       c = (unsigned char)*p;
+       unsigned int ab, c, d;
+       c = *p;
        if (c < 128) continue;                /* ASCII character */
        if (c < 0xc0) return 1;               /* Isolated 10xx xxxx byte */
        if (c >= 0xfe) return 1;             /* Invalid 0xfe or 0xff bytes */

Index: src/main/character.c
===================================================================
--- src/main/character.c        (revision 78116)
+++ src/main/character.c        (working copy)
@@ -180,10 +180,7 @@
                int nc = 0;
                for( ; *p; p += utf8clen(*p)) {
                    utf8toucs(&wc1, p);
-                   if (IS_HIGH_SURROGATE(wc1))
-                       ucs = utf8toucs32(wc1, p);
-                   else
-                       ucs = wc1;
+                    ucs = wc1;
                    nc += Ri18n_wcwidth(ucs);
                }
                return nc;

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Possible Bug In Validation of UTF-8 Sequences

Reply via email to