/Hello. SUMMARY: $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()" Error in substring(x, m + 1L) : invalid multibyte string at '<ff>'
$ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv('BOOM')" [1] "\xff" BACKGROUND: I launch R through an Son of Grid Engine (SGE) scheduler, where the R process is launched on a compute host via 'qrsh', which part of SGE. Without going into details, 'mpirun' is also involved. Regardless, in this process, an 'qrsh'-specific environment variable 'QRSH_COMMAND' is set automatically. The value of this variable comprise of a string with \xff (ASCII 255) injected between the words. This is by design of SGE [1]. Here is an example of what this environment variable may look like: QRSH_COMMAND= orted\xff--hnp-topo-sig\xff2N:2S:32L3:128L2:128L1:128C:256H:x86_64\xff-mca\xffess\xff\"env\"\xff-mca\xfforte_ess_jobid\xff\"3473342464\"\xff-mca\xfforte_ess_vpid\xff1\xff-mca\xfforte_ess_num_procs\xff\"3\"\xff-mca\xfforte_hnp_uri\xff\"3473342464.0;tcp://192.168.1.13:50847\"\xff-mca\xffplm\xff\"rsh\"\xff-mca\xfforte_tag_output\xff\"1\"\xff--tree-spawn" where each \xff is a single byte 255=0xFF=\xFF. ISSUE: An environment variable with embedded 0xFF bytes in its value causes calls to Sys.getenv() to produce an error when running R in a UTF-8 locale. Here is a minimal example on Linux: $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()" Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' Calls: Sys.getenv -> substring In addition: Warning message: In regexpr("=", x, fixed = TRUE) : input string 134 is invalid in this locale Execution halted WORKAROUND: The workaround is to (1) identify any environment variables with invalid UTF-8 symbols, and (2) prune or unset those variables before launching R, e.g. in my SGE case, launching R using: QRSH_COMMAND= Rscript --vanilla -e "Sys.getenv()" avoid the problem. Having to unset/modify environment variables because R doesn't like them, see a bit of an ad-hoc hack to me. Also, if you are not aware of this problem, or not a savvy R user, it can be quite tricky to track down the above error message, especially if Sys.getenv() is called deep down in some package dependency. DISCUSSION/SUGGESTION/ASK: My suggestion would be to make Sys.getenv() robust against any type of byte values in environment variable strings. The error occurs in Sys.getenv() from: x <- .Internal(Sys.getenv(character(), "")) m <- regexpr("=", x, fixed = TRUE) ## produces a warning n <- substring(x, 1L, m - 1L) v <- substring(x, m + 1L) ## produces the error I know too little about string encodings, so I'm not sure what the best approach would be here, but maybe falling back to parsing strings that are invalid in the current locale using the C locale would be reasonable? Maybe Sys.getenv() should always use the C locale for this. It looks like Sys.getenv(name) does this, e.g. $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv('BOOM')" [1] "\xff" I'd appreciate any comments and suggestions. I'm happy to file a bug report on BugZilla, if this is a bug. Henrik [1] https://github.com/gridengine/gridengine/blob/master/source/clients/qrsh/qrsh_starter.c#L462-L466 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel