On 1/31/23 14:37, peter dalgaard wrote:
On 31 Jan 2023, at 12:51 , Tomas Kalibera <tomas.kalib...@gmail.com> wrote:
On 1/31/23 11:50, Martin Maechler wrote:
<snippage>
hmm.., that's a pity; I had hoped it was a pragmatic and valid strategy,
but of course you are right that type stability is really a
valid goal....
In general, what about behaving close to "old R" and replace all such
strings by NA_character_ (and typically raising one warning)?
This would keep the result a valid character vector, just with some NA entries.
Specifically for Sys.getenv(), I still think Simon has a very
valid point of "requiring" (of our design) that
Sys.getenv()[["BOOM"]] {double `[[`} should be the same as
Sys.getenv("BOOM")
Also, as typical R user, I'd definitely want to be able to get all the valid
environment variables, even if there are one or more invalid
ones. ... and similarly in other cases, it may be a cheap
strategy to replace invalid strings ("string" in the sense of
length 1 STRSXP, i.e., in R, a "character" of length 1) by
NA_character_ and keep all valid parts of the character vector
in a valid encoding.
In case of specifically getenv(), yes, we could return NA for variables
containing invalid strings, both when obtaining a value for a single variable
and for multiple, partially matching undocumented and unintentional behavior R
had before 4.1, and making getenv(var) and getenv()[[var]] consistent even with
invalid strings. Once we decide on how to deal with invalid strings in
general, we can change this again accordingly, breaking code for people who
depend on these things (but so far I only know about this one case). Perhaps
this would be a logical alternative to the Python approach that would be more
suitable for R (given we have NAs and given that we happened to have that
somewhat similar alternative before). Conceptually it is about the same thing
as omitting the variable in Python: R users would not be able to use such
variables, but if they don't touch them, they could be inherited to child
processes, etc.
<more snippage>
Hum, I'm out of my waters here, but offhand I would be wary about approaches
that lead to loss of information. Presumably someone will sooner or later
actually want to deal with the content of an environment variable with invalid
bytes inside. I.e. it would be preferable to keep the content and mark the
encoding as something not-multibyte.
In fact this is almost what happens (for me...) if I just add Encoding(x) <- "bytes" for
the return value of .Internal(Sys.getenv(character(), "")):
Sys.getenv()[["BOOM"]]
[1] "\\xff"
Encoding(Sys.getenv())
[1] "unknown" "unknown" "bytes" "unknown" "unknown" "unknown" "unknown"
[8] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
...
but I suppose that breaks if I have environment variables that actually _are_ utf8,
because only plain-ASCII becomes "unknown"? And nchar(Sys.getenv()) also does
not work.
Yes, that way you would get even valid UTF-8 strings represented as
"bytes". But that is not the main problem. It would technically be
possible to keep valid strings in the native encoding (typically UTF-8)
as "native", but only those invalid as "bytes", as Ivan also suggested.
But the key problem is that it breaks because of that "type
instability": other string functions later will start failing on
"bytes", resulting in a mess.
We could provide new API (e.g. argument "useBytes=TRUE") which would
provide all variables as "bytes" (all-ASCII would be "native" per how
"bytes" works) and let the users decide whether they want to use iconv()
to turn some of them into strings, how to do such conversions (e.g.
error, warn, substitute NA, substitute something else). That would allow
working with such variables. That would be probably a "clean" solution
at least for POSIX system but I doubt anyone would use that. For Windows
it would still be questionable (due to the two environment profiles in
two different encodings, which may not match).
(And of course I agree that the QRSH thing is Just Wrong; people using 0xff as
a separator between utf8 strings deserve the same fate as those who used comma
separation between numbers with decimal commas.)
Indeed.
And then I am afraid I have to make my position stronger based on
reading more what Windows do. Their API clearly implies that variables
are strings, because it automatically converts them between "wide" and
"multi-byte". An application can have both of these profiles, then the C
runtime manages both, and they may get out of sync and the documentation
explicitly warns about confusion due to that some characters cannot be
converted to some encodings.
Also, the Windows approach that environment values are strings is
"compatible" with that different applications on Windows may and often
do use different native encoding (some use UTF-8 such as R, but some use
the legacy encoding, e.g. Latin1, but a multi-byte encoding for other
languages). Imagine that an application running in the legacy encoding
sets an environment variable to a valid non-ASCII string. And then you
run R and it tries to read that variable. It works due to encoding
conversions. When the strings are valid.... (and yes, when the mapping
is 1-1, but that's another matter)
So, at least on Windows, environment variables clearly are strings, not
blobs.
Tomas
-pd
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel