On Sun, Nov 30, 2025, at 13:10, [email protected] wrote:
> Keeping the ASCII-only restriction for code is important as it makes
> the code easier to understand by a wider audience.
>
> Allowing non-ASCII characters in literal strings, raw or regular, does
> seem reasonable to me in principle, but others may see issues I am not
> aware of.
>
> But checking for non-ASCII characters in code while allowing non-ASCII
> characters in string literals needs much more sophisticated check code
> than we currently have. If you or anyone else want to see this happen
> you can explore creating a patch and submit to bugzilla for
> consideration.
Fair enough. It might be easier than you suspect, though, since the parser
already does the heavy lifting--- code below.
(i) If the file doesn't even parse, that's a more serious problem!
(ii) If the file does parse OK, then AFAICS the only places that non-ASCII
characters might be lurking are: (a) in comments, where they are somewhat
grudgingly allowed IIRC; (b) in string literals, where we would like to allow
them; and of course (c) in symbols (variable names; see notes below), where we
DON'T want them if it's a package. And this can all be checked easily from
$parseData. My specimen function below does it in ~20 lines of "real" code.
A couple of notes:
#1 I didn't realize that it is even possible to have a "normal" (ie
non-backticked) variable name with non-ASCII letters (see ?Quotes, "Names and
Identifiers"). And indeed I can run the following in my (Anglo) Windows RGUI:
français <- 'bon'
Crikey, that's actually scary... Anyway, the intention is clearly to NOT allow
that in package code, at least not yet.
#2 Should packages nevertheless be allowed to use backticked identifiers
containing non-ASCII characters? (IME backticks are often used for funny names
with all-ASCII characters but in the wrong places.) Personally I'd vote no, but
it's well above my pay grade--- and there's no voting in R. Anyhow, my code
below has an option to check/not-check backticked symbols.
Is this likely to be acceptable? If so I'll try to submit a formal patch.
cheers
Mark
## My function:
check_ASCII_code_MVB <- function(
file, pp= NULL, check_backticks= FALSE
){
# Checks that any non-ASCII UTF-8 characters are confined to
# string-literals & comments
# Can directly supply results of previous parse(), for speed
if( is.null( pp)){ # ... or, if not:
pp <- try( parse( file=file, keep.source=TRUE, encoding='UTF-8'))
if( inherits( pp, 'try-error')){
warning( "Can't even parse, let alone check for non-ASCII")
return( FALSE)
}
}
# Get tokens of "leaf" (terminal) elements, and associated text
# This mimicks utils::getParseData()
ppd <- pp |> attr( 'wholeSrcref') |> attr( 'srcfile') |>
_$parseData |> attributes() |> _[ c( 'tokens', 'text')]
symbols <- with( ppd,
text[ grepl( 'SYMBOL', tokens, fixed=TRUE)])
if( !check_backticks){
# Not obvious whether to allow UTF-8 in backticked names
# AFAICS backticks can only occur both at start and end of a parsable symbol
backy <- startsWith( symbols, r"{`}") & endsWith( symbols, r"{`}")
symbols <- symbols[ !backy]
}
non_ASCII <- .Call( tools:::C_nonASCII, symbols)
OK <- !any( non_ASCII)
if( !OK){
attr( OK, 'offending_symbols') <- unique( symbols[ non_ASCII])
}
return( OK)
}
## A snippet to save into a file, for testing. Note the raw string: irrelevant,
but useful.
nonASCII_R <- r"--{
français <- 'bon'
`français` <- 'bon'
lingo <- "français"
# Nothing wrong with a bit of français in comments
}--" |> strsplit( '\n') |> _[[1]]
writeLines( nonASCII_R, <file of your choice>)
## Possible patch of tools::.check_package_ASCII_code :
.check_package_ASCII_code_patch <- function (
dir, respect_quotes = FALSE
){
if (!dir.exists(dir))
stop(gettextf("directory '%s' does not exist", dir),
domain = NA)
dir <- file_path_as_absolute(dir)
wrong_things <- character()
for (f in c(file.path(dir, "NAMESPACE"),
list_files_with_type(file.path(dir,
"R"), "code", OS_subdirs = c("unix", "windows")))) {
## OLD
#text <- readLines(f, warn = FALSE)
# if (.Call(C_check_nonASCII, text, respect_quotes))
## NEW
if( !check_ASCII_code_MVB( f))
wrong_things <- c(wrong_things, f)
}
if (length(wrong_things)) {
wrong_things <- substring(wrong_things, nchar(dir) +
2L)
cat(wrong_things, sep = "\n")
}
invisible(wrong_things)
}
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel