Re: [Rd] gsub() hex character range problems in R-devel?

2022-01-06 Thread Martin Morgan
Thanks Tomas and 'Brodie' for your expert explanation; it provides great help 
in understanding and solving my immediate problem.

Thomas' observation to 'do something like e.g. "only keep ASCII digits, ASCII 
space, ASCII underscore, but remove all other characters"' points to a basic 
weakness in the code I'm looking at. E.g., removing non-breaking space is 
probably not appropriate ('foo\ua0bar' is probably cleaned to 'foo bar' and not 
'foobar'). And more generally other non-ASCII characters ('fancy' quotes, 
em-dashes, ...) would require special treatment. It seems like the right thing 
to do is to handle the raw data in its original encoding, rather than to try to 
clean it to ASCII.

Martin

On 1/5/22, 4:17 AM, "Tomas Kalibera"  wrote:

Hi Martin,

I'd add few comments to the excellent analysis of Brodie.

- \xhh is allowed and defined in Perl regular expressions, see ?regex 
(would need perl=TRUE), but to enter that in an R string, you need to 
escape the backslash.

- \xhh is not defined by POSIX for extended regular expressions, neither 
it is documented in ?regex for those; TRE supports it, but still 
portable programs should not rely on that

- literal \xhh in an R string is turned to the byte by R, but I would 
say this should not be used at all by users, because the result is 
encoding specific

- use of \u and \U in an R string is fine, it has well defined semantics 
and the corresponding string will then be flagged UTF-8 in R (so e.g. 
\ua0 is fine to represent the Unicode no-break space)

- see caveats of using character ranges with POSIX extended regular 
expressions in ?regex re encodings, using Perl regular expressions in 
UTF-8 mode is more reliable for those

So, a variant of your example might be:

 > gsub("[\\x7f-\\xff]", "", "fo\ua0o", perl=TRUE)
[1] "foo"

(note that the \ua0 ensures that the text is UTF-8, and hence the UTF-8 
mode for regular expressions is used, ?regex has more)

However, I think it is better to formulate regular expressions to cover 
all of Unicode, so do something like e.g. "only keep ASCII digits, ASCII 
space, ASCII underscore, but remove all other characters".

Best
Tomas

On 1/4/22 8:35 PM, Martin Morgan wrote:

> I'm not very good at character encoding / etc so this might be user 
error. The following code is meant to replace extended ASCII characters, in 
particular a non-breaking space, with "", and it works in R-4-1-branch
>
>> R.version.string
> [1] "R version 4.1.2 Patched (2022-01-04 r81445)"
>> gsub("[\x7f-\xff]", "", "fo\xa0o")
> [1] "foo"
>
> but fails in R-devel
>
>> R.version.string
> [1] "R Under development (unstable) (2022-01-04 r81445)"
>> gsub("[\x7f-\xff]", "", "fo\xa0o")
> Error in gsub("[\177-\xff]", "", "fo\xa0o") : invalid regular expression 
'[-�]', reason 'Invalid character range'
> In addition: Warning message:
> In gsub("[\177-\xff]", "", "fo\xa0o") :
>TRE pattern compilation error 'Invalid character range'
>
> There are other oddities, too, like
>
>> gsub("[[:alnum:]]", "", "fo\xa0o")  # R-4-1-branch
> [1] "\xfc\xbe\x8c\x86\x84\xbc"
>
>> gsub("[[:alnum:]]", "", "fo\xa0o")  # R-devel
> [1] "<>"
>
> The R-devel sessionInfo is
>
>> sessionInfo()
> R Under development (unstable) (2022-01-04 r81445)
> Platform: x86_64-apple-darwin19.6.0 (64-bit)
> Running under: macOS Catalina 10.15.7
>
> Matrix products: default
> BLAS:   /Users/ma38727/bin/R-devel/lib/libRblas.dylib
> LAPACK: /Users/ma38727/bin/R-devel/lib/libRlapack.dylib
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base
>
> loaded via a namespace (and not attached):
> [1] compiler_4.2.0
>
> (I have built my own R on macOS; similar behavior is observed on a Linux 
machine)
>
> Any hints welcome,
>
> Martin Morgan
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] gsub() hex character range problems in R-devel?

2022-01-05 Thread Tomas Kalibera

Hi Martin,

I'd add few comments to the excellent analysis of Brodie.

- \xhh is allowed and defined in Perl regular expressions, see ?regex 
(would need perl=TRUE), but to enter that in an R string, you need to 
escape the backslash.


- \xhh is not defined by POSIX for extended regular expressions, neither 
it is documented in ?regex for those; TRE supports it, but still 
portable programs should not rely on that


- literal \xhh in an R string is turned to the byte by R, but I would 
say this should not be used at all by users, because the result is 
encoding specific


- use of \u and \U in an R string is fine, it has well defined semantics 
and the corresponding string will then be flagged UTF-8 in R (so e.g. 
\ua0 is fine to represent the Unicode no-break space)


- see caveats of using character ranges with POSIX extended regular 
expressions in ?regex re encodings, using Perl regular expressions in 
UTF-8 mode is more reliable for those


So, a variant of your example might be:

> gsub("[\\x7f-\\xff]", "", "fo\ua0o", perl=TRUE)
[1] "foo"

(note that the \ua0 ensures that the text is UTF-8, and hence the UTF-8 
mode for regular expressions is used, ?regex has more)


However, I think it is better to formulate regular expressions to cover 
all of Unicode, so do something like e.g. "only keep ASCII digits, ASCII 
space, ASCII underscore, but remove all other characters".


Best
Tomas

On 1/4/22 8:35 PM, Martin Morgan wrote:


I'm not very good at character encoding / etc so this might be user error. The following 
code is meant to replace extended ASCII characters, in particular a non-breaking space, 
with "", and it works in R-4-1-branch


R.version.string

[1] "R version 4.1.2 Patched (2022-01-04 r81445)"

gsub("[\x7f-\xff]", "", "fo\xa0o")

[1] "foo"

but fails in R-devel


R.version.string

[1] "R Under development (unstable) (2022-01-04 r81445)"

gsub("[\x7f-\xff]", "", "fo\xa0o")

Error in gsub("[\177-\xff]", "", "fo\xa0o") : invalid regular expression 
'[-�]', reason 'Invalid character range'
In addition: Warning message:
In gsub("[\177-\xff]", "", "fo\xa0o") :
   TRE pattern compilation error 'Invalid character range'

There are other oddities, too, like


gsub("[[:alnum:]]", "", "fo\xa0o")  # R-4-1-branch

[1] "\xfc\xbe\x8c\x86\x84\xbc"


gsub("[[:alnum:]]", "", "fo\xa0o")  # R-devel

[1] "<>"

The R-devel sessionInfo is


sessionInfo()

R Under development (unstable) (2022-01-04 r81445)
Platform: x86_64-apple-darwin19.6.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /Users/ma38727/bin/R-devel/lib/libRblas.dylib
LAPACK: /Users/ma38727/bin/R-devel/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.2.0

(I have built my own R on macOS; similar behavior is observed on a Linux 
machine)

Any hints welcome,

Martin Morgan
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] gsub() hex character range problems in R-devel?

2022-01-04 Thread Brodie Gaslam via R-devel
> On Tuesday, January 4, 2022, 02:35:50 PM EST, Martin Morgan 
>  wrote:
>
> I'm not very good at character encoding / etc so this might be user
> error. The following code is meant to replace extended ASCII characters,
> in particular a non-breaking space, with "", and it works in
> R-4-1-branch

Martin,

I'm (obviously) not R-Core, so you should take whatever I say with a grain
of salt.  Nonetheless I have run into a similar issue as you, and my
assessment is that the behavior in R-4-1-2 is due to a bug that was fixed
with -r81103 for R-devel only.  It only appears more correct due to
happenstance and "surprising" (at least to me) behavior from the
"corrected" code.

But before I get into the details, I'd be remiss not to add some warnings
about using arbitrary bytes in strings as you do here.  The strings in
your examples are not marked:

    Encoding("fo\xa0o")
    [1] "unknown"

This means internals may interpret them as being in native encoding (UTF-8
in your case, in which your string is invalid).  If you want to use byte
operations you should mark your strings as "bytes" / use the "useBytes"
parameter to the functions in question (and assume all the consequences of
generating invalid encodings), or even better translate the string from its
actual encoding to your encoding.  For your case assuming you have
ISO-8859-1 encoding (I'm just guessing) I would do:

    x <- "fo\xa0o"
    y <- iconv(x, "ISO-8859-1", "UTF-8")
    gsub("\ua0", "", y)
    [1] "foo"

You could also just have marked your string as "latin1" as for 0xA0 it is
the same as ISO-8859-1 and gotten the same result without `iconv`, but the
`iconv` solution is more general.

I'll address the two examples in reverse order as the first one
is more obvious.

> > gsub("[[:alnum:]]", "", "fo\xa0o")  # R-4-1-branch
> [1] "\xfc\xbe\x8c\x86\x84\xbc"
>
> > gsub("[[:alnum:]]", "", "fo\xa0o")  # R-devel
> [1] "<>"

The result in the 4-1 contains bytes not present in the input.  Clearly
this cannot be correct.  R-devel is "correct" if you account for the
surprising (to me) behavior that invalid bytes in UTF-8 interpreted
strings may be escaped in pre-processing.  This is roughly what's
happening:

    "fo\xa0o" -> "foo" -> gsub("[[:alnum:]]", "", "foo") -> "<>"

Where "" is the escaped version of the "\xa0".  It's clearer if you do
(R-devel):

    gsub("f", "", "fo\xa0o")
    [1] "oo"

I do think this "correct" behavior would be better as an error or at a
minimum a warning, and hopefully this is something that will change in the
future.

> > R.version.string
> [1] "R version 4.1.2 Patched (2022-01-04 r81445)"
> > gsub("[\x7f-\xff]", "", "fo\xa0o")
> [1] "foo"
>
> but fails in R-devel
> > R.version.string
> [1] "R Under development (unstable) (2022-01-04 r81445)"
> > gsub("[\x7f-\xff]", "", "fo\xa0o")
> Error in gsub("[\177-\xff]", "", "fo\xa0o") : invalid regular expression 
> '[-�]', reason 'Invalid character range'
> In addition: Warning message:
> In gsub("[\177-\xff]", "", "fo\xa0o") :
>   TRE pattern compilation error 'Invalid character range'

This one is pretty interesting.  The same bug persists, but because it
affects both the pattern and the string to manipulate the bugs cancel out.
If you look at what's happening internally in R-4-1, the range "\x7f-\xff"
is translated to "\u7f-\U{3e3c}", but "fo\xa0o" is also translated to
"fo\U{3e30613c}o", so it happens to work.

Why "\U{3e3c}"?  Well, it's really 3e 66 66 3c, which the code
intended to have interpreted as < f f >.  In ASCII encoding, we have 3e =
<, 66 = f, 3c = >.  So the intent was to write out "", the 4 character
escape for the single byte "\xff".  Instead, the 4 bytes are written into
a single wchar_t (on systems with 32bit wchar_t) and interpreted as that
code point.

In little-endian machines like ours, the double cancellation does not
always work as seen in R-4-1-2:

    gsub("[\x7f-\xab]", "",  "\xab")
    ## [1] ""
    gsub("[\x7f-\xba]", "",  "\xab")  # changed end to be \xba
    ## [1] "\xab"

One would expect the second range to still capture the character, but
because wchar_t is interpreted little endian the order of the "a" and "b"
written into the wchar_t is opposite of what is desired.  So it would not
be possible to leave the bug in (even if it didn't cause other issues) on
the grounds it cancels itself out.

With the patch applied in R-devel, the range "[\x7f-\xff]" becomes
"[\x7f-]", which is invalid because "<" has a lower code point that
"\x7f".  Here the fix exposes the "surprisingness" of the current
behavior.

Although again, you can currently side-step all this simply by
converting everything into valid encodings and avoiding bytes
manipulation, or doing everything very carefully explicitly with "bytes"
marked strings and "useBytes=TRUE".

Best,

B.

> The R-devel sessionInfo is
>
> > sessionInfo()
> R Under development (unstable) (2022-01-04 r81445)
> Platform: x86_64-apple-darwin19.6.0 (64-bit)
> Running under: macOS Catalina 10.15.7
>
> Matrix