> On Tuesday, January 4, 2022, 02:35:50 PM EST, Martin Morgan > <mtmorgan.b...@gmail.com> wrote: > > I'm not very good at character encoding / etc so this might be user > error. The following code is meant to replace extended ASCII characters, > in particular a non-breaking space, with "", and it works in > R-4-1-branch
Martin, I'm (obviously) not R-Core, so you should take whatever I say with a grain of salt. Nonetheless I have run into a similar issue as you, and my assessment is that the behavior in R-4-1-2 is due to a bug that was fixed with -r81103 for R-devel only. It only appears more correct due to happenstance and "surprising" (at least to me) behavior from the "corrected" code. But before I get into the details, I'd be remiss not to add some warnings about using arbitrary bytes in strings as you do here. The strings in your examples are not marked: Encoding("fo\xa0o") [1] "unknown" This means internals may interpret them as being in native encoding (UTF-8 in your case, in which your string is invalid). If you want to use byte operations you should mark your strings as "bytes" / use the "useBytes" parameter to the functions in question (and assume all the consequences of generating invalid encodings), or even better translate the string from its actual encoding to your encoding. For your case assuming you have ISO-8859-1 encoding (I'm just guessing) I would do: x <- "fo\xa0o" y <- iconv(x, "ISO-8859-1", "UTF-8") gsub("\ua0", "", y) [1] "foo" You could also just have marked your string as "latin1" as for 0xA0 it is the same as ISO-8859-1 and gotten the same result without `iconv`, but the `iconv` solution is more general. I'll address the two examples in reverse order as the first one is more obvious. > > gsub("[[:alnum:]]", "", "fo\xa0o") # R-4-1-branch > [1] "\xfc\xbe\x8c\x86\x84\xbc" > > > gsub("[[:alnum:]]", "", "fo\xa0o") # R-devel > [1] "<>" The result in the 4-1 contains bytes not present in the input. Clearly this cannot be correct. R-devel is "correct" if you account for the surprising (to me) behavior that invalid bytes in UTF-8 interpreted strings may be escaped in pre-processing. This is roughly what's happening: "fo\xa0o" -> "fo<a0>o" -> gsub("[[:alnum:]]", "", "fo<a0>o") -> "<>" Where "<a0>" is the escaped version of the "\xa0". It's clearer if you do (R-devel): gsub("f", "", "fo\xa0o") [1] "o<a0>o" I do think this "correct" behavior would be better as an error or at a minimum a warning, and hopefully this is something that will change in the future. > > R.version.string > [1] "R version 4.1.2 Patched (2022-01-04 r81445)" > > gsub("[\x7f-\xff]", "", "fo\xa0o") > [1] "foo" > > but fails in R-devel > > R.version.string > [1] "R Under development (unstable) (2022-01-04 r81445)" > > gsub("[\x7f-\xff]", "", "fo\xa0o") > Error in gsub("[\177-\xff]", "", "fo\xa0o") : invalid regular expression > '[-�]', reason 'Invalid character range' > In addition: Warning message: > In gsub("[\177-\xff]", "", "fo\xa0o") : > TRE pattern compilation error 'Invalid character range' This one is pretty interesting. The same bug persists, but because it affects both the pattern and the string to manipulate the bugs cancel out. If you look at what's happening internally in R-4-1, the range "\x7f-\xff" is translated to "\u7f-\U{3e66663c}", but "fo\xa0o" is also translated to "fo\U{3e30613c}o", so it happens to work. Why "\U{3e66663c}"? Well, it's really 3e 66 66 3c, which the code intended to have interpreted as < f f >. In ASCII encoding, we have 3e = <, 66 = f, 3c = >. So the intent was to write out "<ff>", the 4 character escape for the single byte "\xff". Instead, the 4 bytes are written into a single wchar_t (on systems with 32bit wchar_t) and interpreted as that code point. In little-endian machines like ours, the double cancellation does not always work as seen in R-4-1-2: gsub("[\x7f-\xab]", "", "\xab") ## [1] "" gsub("[\x7f-\xba]", "", "\xab") # changed end to be \xba ## [1] "\xab" One would expect the second range to still capture the character, but because wchar_t is interpreted little endian the order of the "a" and "b" written into the wchar_t is opposite of what is desired. So it would not be possible to leave the bug in (even if it didn't cause other issues) on the grounds it cancels itself out. With the patch applied in R-devel, the range "[\x7f-\xff]" becomes "[\x7f-<ff>]", which is invalid because "<" has a lower code point that "\x7f". Here the fix exposes the "surprisingness" of the current behavior. Although again, you can currently side-step all this simply by converting everything into valid encodings and avoiding bytes manipulation, or doing everything very carefully explicitly with "bytes" marked strings and "useBytes=TRUE". Best, B. > The R-devel sessionInfo is > > > sessionInfo() > R Under development (unstable) (2022-01-04 r81445) > Platform: x86_64-apple-darwin19.6.0 (64-bit) > Running under: macOS Catalina 10.15.7 > > Matrix products: default > BLAS: /Users/ma38727/bin/R-devel/lib/libRblas.dylib > LAPACK: /Users/ma38727/bin/R-devel/lib/libRlapack.dylib > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > loaded via a namespace (and not attached): > [1] compiler_4.2.0 > > (I have built my own R on macOS; similar behavior is observed on a Linux > machine) > > Any hints welcome, > > Martin Morgan > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel