As per `?intToUtf8`, and in the comments to `valid_utf8`[1], R intends to prevent illegal UTF-8 such as UTF-8 encoded UTF-16 surrogate pairs. `R_nchar`, invoked via `base::nchar`, explicitly validates UTF-8 strings[2], but allows the surrogate:
> Encoding('\ud800') [1] "UTF-8" > nchar('\ud800') // should be an error [1] 1 The problem manifests on systems where `char` is signed. The logic used to test for the forbidden sequences implicitly assumes that `char` is unsigned[3]: if (c == 0xe0 && (d & 0x20) == 0) return 1; if (c == 0xed && d >= 0xa0) return 1; Notice the `d >= 0xa0`. On a system where `char` is signed this can only ever be true if a byte has more than 8 bits, as otherwise the maximum value of `d` is 0x7f because `d` is retrieved from a plain `char` pointer[4] (d is `int`): if (((d = *(++p)) & 0xc0) != 0x80) return 1; Where `p` is defined as[5]: const char *p; In contrast `c` above is correctly cast to `unsigned char` prior to use[8]: c = (unsigned char)*p; I attach a simple patch to address this. I also include a patch to remove the handling of surrogates from `R_nchar` as that should not longer be necessary, and additionally the current handling appears incorrect. AFAICT, the current handling attempts to decode a surrogate pair by combining the high surrogate with the same high surrogate, instead of the high surrogate with the subsequent character that hopefully is the low surrogate[7]. Here is some code that could be added to regression tests: surr_good <- '\ud840\udc00' # auto-converts to normal surr_bad <- paste0('\ud840', '\udc00') # surrogates remain good <- c('hello', 'world', surr_good, '\ud7ff', '\ue000', '\U0010ffff') bad <- c(surr_bad, '\ud800', '\udfff', '\U00110000') On R3.6.3: nchar(good, allowNA=TRUE) [1] 5 5 1 1 1 1 nchar(bad, allowNA=TRUE) [1] 2 1 1 1 On R-devel (2020-03-31 r78116) w/ patch: nchar(good, allowNA=TRUE) [1] 5 5 1 1 1 1 nchar(bad, allowNA=TRUE) [1] NA NA NA NA I ran `make check-devel` successfully, although I did have to suppress one PCRE test[9] that segfaults on my particular set-up, though that segfaulted prior to my patch as well. The patch does not prevent the creation of illegal UTF-8 strings, although I imagine it would be straightforward to add checks to the entry points into CHARSXPs if that were desired. Finally, this signed char business hints at a potential broader issue. If I understand correctly importing byte sequences with values greater than 0x7f overflows the `char` buffers on systems with signed chars and octet (or lesser) bytes, e.g. as in `do_readLines`[6] where an integer procured via `fgetc` is subsequently stored in a `char` buffer. Obviously this hasn't mattered much to date, presumably because the implementations R runs on define the `unsigned char` to `signed char` conversion in such a way that the `signed char` to `unsigned char` conversion restores the original value. I don't know if this is something explicitly checked for like the `int` == 32bits assumption. [1]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/valid_utf8.h#L61 [2]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/character.c#L148 [3]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/valid_utf8.h#L106 [4]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/valid_utf8.h#L84 [5]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/valid_utf8.h#L69 [6]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/connections.c#L3935 [7]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/character.c#L184 [8]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/valid_utf8.h#L73 [9]: https://github.com/wch/r-source/blob/tags/R-3-6-3/tests/PCRE.R#L16
Index: src/main/valid_utf8.h =================================================================== --- src/main/valid_utf8.h (revision 78116) +++ src/main/valid_utf8.h (working copy) @@ -66,11 +66,11 @@ static int valid_utf8(const char *string, size_t length) // R change int->size_t { - const char *p; + const unsigned char *p; for (p = string; length-- > 0; p++) { - int ab, c, d; - c = (unsigned char)*p; + unsigned int ab, c, d; + c = *p; if (c < 128) continue; /* ASCII character */ if (c < 0xc0) return 1; /* Isolated 10xx xxxx byte */ if (c >= 0xfe) return 1; /* Invalid 0xfe or 0xff bytes */
Index: src/main/character.c =================================================================== --- src/main/character.c (revision 78116) +++ src/main/character.c (working copy) @@ -180,10 +180,7 @@ int nc = 0; for( ; *p; p += utf8clen(*p)) { utf8toucs(&wc1, p); - if (IS_HIGH_SURROGATE(wc1)) - ucs = utf8toucs32(wc1, p); - else - ucs = wc1; + ucs = wc1; nc += Ri18n_wcwidth(ucs); } return nc;
______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel