I just saw a commit accidentally that adds iconv() support for the c99 \u escapes, which might or might not be accidental: https://github.com/wch/r-source/commit/f19b4ae7715eea1b18ef8368b4c2849a578ade07
In any case, this is great, and very useful to have cross-platform for it. Thank you! Would it make sense to generate braced 4-digit \uxxxx sequences, to make sure that they don't mix with the surrounding text? I.e. \u{xxxx}? (Plus update the 6 to 8 twice.) https://github.com/wch/r-source/commit/f19b4ae7715eea1b18ef8368b4c2849a578ade07#diff-9a906ea3803721bf2aa8b802e98786c3b096727d87f1c423826e3bba4c112d76R746-R747 Also, it seems that we need a capital \U for the 8-digit sequences here: https://github.com/wch/r-source/commit/f19b4ae7715eea1b18ef8368b4c2849a578ade07#diff-9a906ea3803721bf2aa8b802e98786c3b096727d87f1c423826e3bba4c112d76R753 Thank you again, Gabor On Mon, Feb 21, 2022 at 2:17 PM Brodie Gaslam <brodie.gas...@yahoo.com> wrote: > > I'm not R-core, but happen to have run into this issue. > > I think this makes sense conceptually, and have had the same thought > myself. One implementation challenge is that the parser has a special > branch for Unicode escape strings (e.g. "G\u00e1bor") that limits such > input to 10K wide characters, so the parser would need to be modified in > order to make this a general solution: > > > parse(text=sprintf('"%s"', strrep("G\\u00e1bor", 2000))) > Error in parse(text = sprintf("\"%s\"", strrep("G\\u00e1bor", 2000))) : > string at line 1 containing Unicode escapes not in this locale > is too long (max 10000 chars) > > Such strings are rare so maybe an interim solution is just to allow it > for deparsing of shorter strings. The parser modification itself would > also have the benefit of speeding up parsing of strings without Unicode > escapes. > > Best, > > B. > > > On 2/21/22 5:33 AM, Gábor Csárdi wrote: > > I am wondering if it would make sense to produce \u escaped strings in > > deparse() for UTF-8 input. Currently we have (in R-devel): > > > > x <- "G\u00e1bor" > > Sys.setlocale("LC_ALL", "C") > > #> [1] "C/C/C/C/C/en_US.UTF-8" > > > > deparse(x) > > #> [1] "\"G<U+00E1>bor\"" > > > > charToRaw(deparse(x)) > > #> [1] 22 47 3c 55 2b 30 30 45 31 3e 62 6f 72 22 > > > > Is there a reason why this is preferable instead of returning > > > > "\"G\\u00e1bor\"" > > > > i.e. > > > > charToRaw("\"G\\u00e1bor\"") > > #> [1] 22 47 5c 75 30 30 65 31 62 6f 72 22 > > > > Returning the \u escaped form would make deparse() the inverse of > > parse(), at least in this respect. > > > > Thank you, > > Gabor > > > > ______________________________________________ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel