Dear all,

packages for processing text may need information on the charset of the R 
session. In my packages RcppCWB and polmineR, I extract this information from 
the locale using `localeToCharset()`. But when running cross-platform checks 
(Github Actions and Docker), I recurringly encounter unexpected behavior of 
`localeToCharset()`.

As a a reproducible example, I suggest to use a local Fedora (latest) 
container, starting as follows:

docker pull fedora:latest
docker run -it fedora:latest /bin/bash

After installing R (`yum install -y R`) and starting R, `localeToCharset()` 
returns `NA`. However, the part of sessionInfo() on the locale is as follows:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

If I run R CMD check on any arbitrary package in this environment at this 
stage, I see:
* using session charset: UTF-8

The documentation says however: ‚In the C locale the answer will be "ASCII".’  
Why not UTF-8 in this case?

The `localeToCharset()` function is also confusing for me, when I explicitly 
re-define the locale. In my fresh Fedora docker container, I need to install 
English-language locales first:
dnf install langpacks-en

After starting R with a re-defined locale (`env LC_CTYPE=en_US.UTF-8 R`,  the 
output of `localeToCharset()` is:
[1] "UTF-8"     "ISO8859-1"

The “Value” section of the documentation says: “A character vector naming an 
encoding and possibly a fallback single-encoding, NA if unknown.”  But I do not 
understand why ISO8859-1 might be a fallback option here?

I do not know whether this is just a matter of documentation? My intuition is 
that `localeToCharset()` should work differently. At the moment, I need to rely 
on a few workarounds to cope with the behavior I do not understand.  (Or is 
there a better function to detect the encoding of the R session?)

Part of my analysis of the code of `localeToCharset()` is that it targets 
special scenarios on Windows and macOS, but not on Linux.

Kind regards
Andreas

--
Prof. Dr. Andreas Blaette
Professor of Public Policy and Regional Politics
University of Duisburg-Essen



        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to