On Wed, Jan 17, 2007 at 11:56:15PM -0800, Ross Boylan wrote: > An earlier thread (in 10/2006) discussed encoding issues in the > context of R data and the desire to represent accented characters. > > It matters in another setting: the output generated by R and the > seemingly order character "'" (single quote). In particular, R CMD ^^^ should be "ordinary" > check runs test code and compares the generated output to a saved file > of expected output. This does not work reliably across encoding > schemes. This is unfortunate, since it seems the "expected output" > files will necessarily be wrong for someone. > > The problem for me was triggered by the single-quote character "'". > On my older systems, this is encoded by 0x27, a perfectly fine ASCII > character. That is on a Debian GNU/Linux system with LANG=en_US. On > a newer system I have LANG=en_US.UTF-8. I don't recall whether > this was a deliberate choice on my part, or simply reflects changing > defaults for the installer. (Note the earlier thread referred to the > Debian-derived Ubuntu systems as having switched to UTF-8). Under > UTF-8 the same character is encoded in the 3-byte sequence 0xE28098 > (which seems odd; I thought the point of UTF-8 was that ASCII was a > legitimate subset).
Apparently quoting, particularly single quotes, is a can of worms: http://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html When Unicode is available (which would be the case with UTF-8), particular non-ASCII characters are recommended for single quoting. The 3 byte sequence is the UTF-8 encoding of x2018, the recommended left single quote mark. See http://en.wikipedia.org/wiki/UTF-8 on UTF-8 encoding. This is more than I or, probably, you ever wanted to know about this issue! Ross > > The coefficient printing methods in the stats package use the > single-quote in the key explaining significance levels: > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > > I suppose one possible work-around for R CMD check would be to set the > encoding to some standard value before it runs tests, but that has > some drawbacks. It doesn't work for packages needing a different > encoding (but perhaps the package could specify an encoding to use by > default?)(*), It will leave the output files looking weird on systems > with a different encoding. It will get messed up if one generates the > files under the wrong encoding. > > And none of this addresses stuff beyond the context of output file > comparison in R CMD check. > > Any thoughts? > > Ross Boylan > > > * From the R Extensions document, discussing the DESCRIPTION file: > If the `DESCRIPTION' file is not entirely in ASCII it should contain > an `Encoding' field specifying an encoding. This is currently used as > the encoding of the `DESCRIPTION' file itself, and may in the future be > taken as the encoding for other documentation in the package. Only > encoding names `latin1', `latin2' and `UTF-8' are known to be portable. > > I would not expect that the test output files be considered > "documentation," but I suppose that's subject to interpretation. ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel