Hi Vladimir,

thanks for your report - this was really a bug, now fixed in R-devel and to appear in 3.5.0.

Apart from the bug, having source files in UTF-8 and reading them into R on Windows is perfectly fine, you just need to specify that they are in UTF-8. You also need to make sure R is running in Russian locale (CP1251) if that is not the default. On my system, this works fine

Sys.setlocale(locale="Russian")
source("russian_utf8.R", encoding="UTF-8")

Best
Tomas


On 08/28/2017 11:27 AM, Владимир Панфилов wrote:
Hello,

I do not have an account on R Bugzilla, so I will post my bug report here.
I want to report a very old bug in base R *source()* function. It relates
to sourcing some R scripts in UTF-8 encoding on Windows machines. For some
reason if the UTF-8 script is containing cyrillic letter *"я"*, the script
execution is interrupted directly on this letter (btw the same scripts are
sourcing fine when they are encoded in the systems CP1251 encoding).

Let's consider the following script that prints random russian words:



*print("Осень")print("Ёжик")print("трясина")print("тест")*

When this script is sourced we get INCOMPLETE_STRING error:





*source('D:/R code/test_cyr_letter.R', encoding = 'UTF-8', echo=TRUE)Error
in source("D:/R code/test_cyr_letter.R", encoding = "UTF-8", echo = TRUE)
:   D:/R code/test_cyr_letter.R:3:7: unexpected INCOMPLETE_STRING2:
print("Ёжик")3: print("тр         ^*

Note that this bug is not triggered when the same file is executed using
*eval(parse(...))*:




*> eval(parse('D:/R code/test_cyr_letter.R', encoding="UTF-8"))[1]
"Осень"[1] "Ёжик"[1] "трясина"[1] "тест"*

I made some reserach and noticed that *source* and *parse* functions have
similar parts of code for reading files. After analyzing code of *source()*
function I found out that commenting one line from it fixes this bug and
the overrided function works fine. See this part of *source()* function
code:

*... *
*filename <- file*

*        file <- file(filename, "r")*

*        # on.exit(close(file))  #### COMMENT THIS LINE ####*

*        if (isTRUE(keep.source)) {*

*          lines <- scan(file, what="character", encoding = encoding, sep
= "\n")*
*          on.exit()*

*          close(file)*

*          srcfile <- srcfilecopy(filename, lines,
file.mtime(filename)[1], *
*                                 isFile = TRUE)*

*        } *

*...*


I do not fully understand this weird behaviour, so I ask help of R Core
developers to fix this annoying bug that prevents using unicode scripts
with cyrillic on Windows.
Maybe you should make that part of *source()* function read files like
*parse()* function?

*Session and encoding info:*

sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=Russian_Russia.1251  LC_CTYPE=Russian_Russia.1251
  LC_MONETARY=Russian_Russia.1251
[4] LC_NUMERIC=C                    LC_TIME=Russian_Russia.1251
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
loaded via a namespace (and not attached):
[1] compiler_3.4.1 tools_3.4.1


l10n_info()
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] FALSE
$codepage
[1] 1251
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to