On 1/31/23 14:37, peter dalgaard wrote:

On 31 Jan 2023, at 12:51 , Tomas Kalibera <tomas.kalib...@gmail.com> wrote:


On 1/31/23 11:50, Martin Maechler wrote:
<snippage>
hmm.., that's a pity; I had hoped it was a pragmatic and valid strategy,
but of course you are right that type stability is really a
valid goal....

In general, what about behaving close to "old R" and replace all such
strings by  NA_character_  (and typically raising one warning)?
This would keep the result a valid character vector, just with some NA entries.

Specifically for  Sys.getenv(),  I still think Simon has a very
valid point of "requiring" (of our design) that
Sys.getenv()[["BOOM"]]  {double `[[`} should be the same as
Sys.getenv("BOOM")

Also, as typical R user, I'd definitely want to be able to get all the valid
environment variables, even if there are one or more invalid
ones. ... and similarly in other cases, it may be a cheap
strategy to replace invalid strings ("string" in the sense of
length 1 STRSXP, i.e., in R, a "character" of length 1) by
NA_character_  and keep all valid parts of the character vector
in a valid encoding.
In case of specifically getenv(), yes, we could return NA for variables 
containing invalid strings, both when obtaining a value for a single variable 
and for multiple, partially matching undocumented and unintentional behavior R 
had before 4.1, and making getenv(var) and getenv()[[var]] consistent even with 
invalid strings.  Once we decide on how to deal with invalid strings in 
general, we can change this again accordingly, breaking code for people who 
depend on these things (but so far I only know about this one case). Perhaps 
this would be a logical alternative to the Python approach that would be more 
suitable for R (given we have NAs and given that we happened to have that 
somewhat similar alternative before). Conceptually it is about the same thing 
as omitting the variable in Python: R users would not be able to use such 
variables, but if they don't touch them, they could be inherited to child 
processes, etc.
<more snippage>

Hum, I'm out of my waters here, but offhand I would be wary about approaches 
that lead to loss of information. Presumably someone will sooner or later 
actually want to deal with the content of an environment variable with invalid 
bytes inside. I.e. it would be preferable to keep the content and mark the 
encoding as something not-multibyte.

In fact this is almost what happens (for me...) if I just add Encoding(x) <- "bytes" for 
the return value of .Internal(Sys.getenv(character(), "")):

Sys.getenv()[["BOOM"]]
[1] "\\xff"
Encoding(Sys.getenv())
  [1] "unknown" "unknown" "bytes"   "unknown" "unknown" "unknown" "unknown"
  [8] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
...

but I suppose that breaks if I have environment variables that actually _are_ utf8, 
because only plain-ASCII becomes "unknown"? And nchar(Sys.getenv()) also does 
not work.

Yes, that way you would get even valid UTF-8 strings represented as "bytes". But that is not the main problem. It would technically be possible to keep valid strings in the native encoding (typically UTF-8) as "native", but only those invalid as "bytes", as Ivan also suggested.

But the key problem is that it breaks because of that "type instability": other string functions later will start failing on "bytes", resulting in a mess.

We could provide new API (e.g. argument "useBytes=TRUE") which would provide all variables as "bytes" (all-ASCII would be "native" per how "bytes" works) and let the users decide whether they want to use iconv() to turn some of them into strings, how to do such conversions (e.g. error, warn, substitute NA, substitute something else). That would allow working with such variables. That would be probably a "clean" solution at least for POSIX system but I doubt anyone would use that. For Windows it would still be questionable (due to the two environment profiles in two different encodings, which may not match).

(And of course I agree that the QRSH thing is Just Wrong; people using 0xff as 
a separator between utf8 strings deserve the same fate as those who used comma 
separation between numbers with decimal commas.)

Indeed.

And then I am afraid I have to make my position stronger based on reading more what Windows do. Their API clearly implies that variables are strings, because it automatically converts them between "wide" and "multi-byte". An application can have both of these profiles, then the C runtime manages both, and they may get out of sync and the documentation explicitly warns about confusion due to that some characters cannot be converted to some encodings.

Also, the Windows approach that environment values are strings is "compatible" with that different applications on Windows may and often do use different native encoding (some use UTF-8 such as R, but some use the legacy encoding, e.g. Latin1, but a multi-byte encoding for other languages). Imagine that an application running in the legacy encoding sets an environment variable to a valid non-ASCII string. And then you run R and it tries to read that variable. It works due to encoding conversions. When the strings are valid.... (and yes, when the mapping is 1-1, but that's another matter)

So, at least on Windows, environment variables clearly are strings, not blobs.

Tomas

-pd


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to