Re: [Rd] Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?
No, I don't think anyone is working on this. There's a fairly simple workaround for the UTF-16 and UTF-32 iconv issues: don't attempt to produce character vectors, produce raw vectors instead. (The "toRaw" argument to iconv() asks for this.) Raw vectors can contain embedded nulls. Character vectors can't, because internally, R is using 8 bit C strings, and the nulls are string terminators. I don't know how difficult it would be to fix the write.table problems. Duncan Murdoch On 29/04/2017 7:53 PM, Jack Kelley wrote: "R version 3.4.0 (2017-04-21)" on "x86_64-w64-mingw32" platform I am using CSVs and other text tables, and text in general (including regular expressions), on Windows 10. For me, that means dealing with Windows-1252 and UTF-8 encoding, with UTF-16 and UTF-32 as helpful curiosities. Something as simple as iconv ("\n", to = "UTF-16") causes an error, due to an embedded nul. Then there is write.csv (or write.table) with its fileEncoding parameter: not working correctly for UTF-16 and UTF-32. Of course, developers are aware of this, for example … [Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param) https://stat.ethz.ch/pipermail/r-devel/2016-February/072323.html iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param) http://r.789695.n4.nabble.com/iconv-to-UTF-16-encoding-produces-error-due-to -embedded-nulls-write-table-with-fileEncoding-param-td4717481.html Focussing on write.csv and UTF-16LE and UTF-16BE, it seems that a nul character is omitted in each pair. TEST SCRIPT remove (list = objects()) print (sessionInfo()) cat ("-\n\n") LE <- data.frame ( want = c ("0d,00", "0a,00"), got = c ("0d ", "0a,00") ) BE <- data.frame ( want = c ("00,0d", "00,0a"), got = c ("00,0d", " 0a") ) write.csv (LE, "R_LE.csv", fileEncoding = "UTF-16LE", row.names = FALSE) write.csv (BE, "R_BE.csv", fileEncoding = "UTF-16BE", row.names = FALSE) print (readBin ("R_LE.csv", "raw", 1000)) print (LE) cat ("\n") print (readBin ("R_BE.csv", "raw", 1000)) print (BE) cat ("\n") try (iconv ("\n", to = "UTF-8")) try (iconv ("\n", to = "UTF-16LE")) try (iconv ("\n", to = "UTF-16BE")) try (iconv ("\n", to = "UTF-16")) try (iconv ("\n", to = "UTF-32LE")) try (iconv ("\n", to = "UTF-32BE")) try (iconv ("\n", to = "UTF-32")) TEST SCRIPT OUTPUT source ("bug_encoding.R") R version 3.4.0 (2017-04-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 14393) Matrix products: default locale: [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C [5] LC_TIME=English_Australia.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.4.0 - [1] 22 00 77 00 61 00 6e 00 74 00 22 00 2c 00 22 00 67 00 6f 00 74 00 22 00 0d [26] 0a 00 22 00 30 00 64 00 2c 00 30 00 30 00 22 00 2c 00 22 00 30 00 64 00 20 [51] 00 20 00 20 00 22 00 0d 0a 00 22 00 30 00 61 00 2c 00 30 00 30 00 22 00 2c [76] 00 22 00 30 00 61 00 2c 00 30 00 30 00 22 00 0d 0a 00 want got 1 0d,00 0d 2 0a,00 0a,00 [1] 00 22 00 77 00 61 00 6e 00 74 00 22 00 2c 00 22 00 67 00 6f 00 74 00 22 00 [26] 0d 0a 00 22 00 30 00 30 00 2c 00 30 00 64 00 22 00 2c 00 22 00 30 00 30 00 [51] 2c 00 30 00 64 00 22 00 0d 0a 00 22 00 30 00 30 00 2c 00 30 00 61 00 22 00 [76] 2c 00 22 00 20 00 20 00 20 00 30 00 61 00 22 00 0d 0a want got 1 00,0d 00,0d 2 00,0a0a Error in iconv("\n", to = "UTF-16LE") : embedded nul in string: '\n\0' Error in iconv("\n", to = "UTF-16BE") : embedded nul in string: '\0\n' Error in iconv("\n", to = "UTF-16") : embedded nul in string: 'þÿ\0\n' Error in iconv("\n", to = "UTF-32LE") : embedded nul in string: '\n\0\0\0' Error in iconv("\n", to = "UTF-32BE") : embedded nul in string: '\0\0\0\n' Error in iconv("\n", to = "UTF-32") : embedded nul in string: '\0\0þÿ\0\0\0\n' Cheers -- Jack Kelley __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?
On 30/04/2017 12:23 PM, Duncan Murdoch wrote: No, I don't think anyone is working on this. There's a fairly simple workaround for the UTF-16 and UTF-32 iconv issues: don't attempt to produce character vectors, produce raw vectors instead. (The "toRaw" argument to iconv() asks for this.) Raw vectors can contain embedded nulls. Character vectors can't, because internally, R is using 8 bit C strings, and the nulls are string terminators. I don't know how difficult it would be to fix the write.table problems. I've now taken a look, and it appears as if it's not too hard. I'll see if I can work out a patch that I trust. Duncan Murdoch Duncan Murdoch On 29/04/2017 7:53 PM, Jack Kelley wrote: "R version 3.4.0 (2017-04-21)" on "x86_64-w64-mingw32" platform I am using CSVs and other text tables, and text in general (including regular expressions), on Windows 10. For me, that means dealing with Windows-1252 and UTF-8 encoding, with UTF-16 and UTF-32 as helpful curiosities. Something as simple as iconv ("\n", to = "UTF-16") causes an error, due to an embedded nul. Then there is write.csv (or write.table) with its fileEncoding parameter: not working correctly for UTF-16 and UTF-32. Of course, developers are aware of this, for example … [Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param) https://stat.ethz.ch/pipermail/r-devel/2016-February/072323.html iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param) http://r.789695.n4.nabble.com/iconv-to-UTF-16-encoding-produces-error-due-to -embedded-nulls-write-table-with-fileEncoding-param-td4717481.html Focussing on write.csv and UTF-16LE and UTF-16BE, it seems that a nul character is omitted in each pair. TEST SCRIPT remove (list = objects()) print (sessionInfo()) cat ("-\n\n") LE <- data.frame ( want = c ("0d,00", "0a,00"), got = c ("0d ", "0a,00") ) BE <- data.frame ( want = c ("00,0d", "00,0a"), got = c ("00,0d", " 0a") ) write.csv (LE, "R_LE.csv", fileEncoding = "UTF-16LE", row.names = FALSE) write.csv (BE, "R_BE.csv", fileEncoding = "UTF-16BE", row.names = FALSE) print (readBin ("R_LE.csv", "raw", 1000)) print (LE) cat ("\n") print (readBin ("R_BE.csv", "raw", 1000)) print (BE) cat ("\n") try (iconv ("\n", to = "UTF-8")) try (iconv ("\n", to = "UTF-16LE")) try (iconv ("\n", to = "UTF-16BE")) try (iconv ("\n", to = "UTF-16")) try (iconv ("\n", to = "UTF-32LE")) try (iconv ("\n", to = "UTF-32BE")) try (iconv ("\n", to = "UTF-32")) TEST SCRIPT OUTPUT source ("bug_encoding.R") R version 3.4.0 (2017-04-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 14393) Matrix products: default locale: [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C [5] LC_TIME=English_Australia.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.4.0 - [1] 22 00 77 00 61 00 6e 00 74 00 22 00 2c 00 22 00 67 00 6f 00 74 00 22 00 0d [26] 0a 00 22 00 30 00 64 00 2c 00 30 00 30 00 22 00 2c 00 22 00 30 00 64 00 20 [51] 00 20 00 20 00 22 00 0d 0a 00 22 00 30 00 61 00 2c 00 30 00 30 00 22 00 2c [76] 00 22 00 30 00 61 00 2c 00 30 00 30 00 22 00 0d 0a 00 want got 1 0d,00 0d 2 0a,00 0a,00 [1] 00 22 00 77 00 61 00 6e 00 74 00 22 00 2c 00 22 00 67 00 6f 00 74 00 22 00 [26] 0d 0a 00 22 00 30 00 30 00 2c 00 30 00 64 00 22 00 2c 00 22 00 30 00 30 00 [51] 2c 00 30 00 64 00 22 00 0d 0a 00 22 00 30 00 30 00 2c 00 30 00 61 00 22 00 [76] 2c 00 22 00 20 00 20 00 20 00 30 00 61 00 22 00 0d 0a want got 1 00,0d 00,0d 2 00,0a0a Error in iconv("\n", to = "UTF-16LE") : embedded nul in string: '\n\0' Error in iconv("\n", to = "UTF-16BE") : embedded nul in string: '\0\n' Error in iconv("\n", to = "UTF-16") : embedded nul in string: 'þÿ\0\n' Error in iconv("\n", to = "UTF-32LE") : embedded nul in string: '\n\0\0\0' Error in iconv("\n", to = "UTF-32BE") : embedded nul in string: '\0\0\0\n' Error in iconv("\n", to = "UTF-32") : embedded nul in string: '\0\0þÿ\0\0\0\n' Cheers -- Jack Kelley __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?
Thanks for looking into this. A few notes regarding all the UTF encodings on Windows 10 ... The default eol for write.csv (via write.table) is "\n" and always gives as.raw (c (0x0d, 0x0a)), that is, as adjacent bytes. This is fine for UTF-8 but wrong for UTF-16 and UTF-32. EXAMPLE: Using UTF-32 for exaggeration (note also that 3 nul bytes are missing in the final CR+LF): df <- data.frame (x = 1:2, y = 3:4) $`UTF-32LE`$default.eol$raw [1] 22 00 00 00 78 00 00 00 22 00 00 00 2c 00 00 00 22 00 00 00 79 00 00 00 22 [26] 00 00 00 0d 0a 00 00 00 31 00 00 00 2c 00 00 00 33 00 00 00 0d 0a 00 00 00 [51] 32 00 00 00 2c 00 00 00 34 00 00 00 0d 0a 00 00 00 $`UTF-32BE`$default.eol$raw [1] 00 00 00 22 00 00 00 78 00 00 00 22 00 00 00 2c 00 00 00 22 00 00 00 79 00 [26] 00 00 22 00 00 00 0d 0a 00 00 00 31 00 00 00 2c 00 00 00 33 00 00 00 0d 0a [51] 00 00 00 32 00 00 00 2c 00 00 00 34 00 00 00 0d 0a (Nevertheless, Microsoft Excel 2013 tolerates these CSVs!) One trick/solution is to use eol = "\r" (that is, only). Regards -- Jack Kelley remove (list = objects()) print (sessionInfo()) cat ("##\n\n") ENCODING <- c ( "UTF-8", "UTF-16LE", "UTF-16BE", "UTF-16", "UTF-32LE", "UTF-32BE", "UTF-32" ) df <- data.frame (x = 1:2, y = 3:4) csv <- structure (lapply (ENCODING, function (encoding) { csv <- sprintf ("df_%s.csv", encoding) write.csv (df, csv, fileEncoding = encoding, row.names = FALSE) list (default.eol = list ( csv = csv, raw = readBin (csv, "raw", 1000)) ) }), .Names = ENCODING) EOL <- c (LF = "\n", CR = "\r", "CR+LF" = "\r\n") CSV <- structure (lapply (ENCODING, function (encoding) { structure ( lapply (names (EOL), function (EOL.name) { csv <- sprintf ("df_%s_eol=%s.csv", encoding, EOL.name) write.csv ( df, csv, fileEncoding = encoding, row.names = FALSE, eol = EOL [EOL.name] ) list (csv = csv, raw = readBin (csv, "raw", 1000)) }), .Names = names (EOL)) }), .Names = ENCODING) print (csv) print (CSV) ---------------- -------- -Original Message----- From: Duncan Murdoch [mailto:murdoch.dun...@gmail.com] Sent: Tuesday, 2 May 2017 04:22 To: Jack Kelley ; r-devel@r-project.org Subject: Re: [Rd] Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ? On 30/04/2017 12:23 PM, Duncan Murdoch wrote: > No, I don't think anyone is working on this. > > There's a fairly simple workaround for the UTF-16 and UTF-32 iconv > issues: don't attempt to produce character vectors, produce raw vectors > instead. (The "toRaw" argument to iconv() asks for this.) Raw vectors > can contain embedded nulls. Character vectors can't, because > internally, R is using 8 bit C strings, and the nulls are string > terminators. > > I don't know how difficult it would be to fix the write.table problems. I've now taken a look, and it appears as if it's not too hard. I'll see if I can work out a patch that I trust. Duncan Murdoch > > Duncan Murdoch > > On 29/04/2017 7:53 PM, Jack Kelley wrote: >> "R version 3.4.0 (2017-04-21)" on "x86_64-w64-mingw32" platform >> ... [rest omitted] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?
Correction to my previous post: Not just the final CR+LF... Change EXAMPLE: Using UTF-32 for exaggeration (note also that 3 nul bytes are missing in the final CR+LF): to EXAMPLE: Using UTF-32 for exaggeration (note also that 3 nul bytes are missing in *each* CR+LF): -- Jack Kelley __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?
On 01/05/2017 8:49 PM, Jack Kelley wrote: Thanks for looking into this. A few notes regarding all the UTF encodings on Windows 10 ... This all stems from the ancient bad decision by Microsoft to translate LF characters to CR LF when writing text files. R passes 0A or 0A 00 or 0A 00 00 00 to the output routine (part of the C run-time), and it needs to figure out how many characters there are in those bytes in order to add the appropriate CR with the right width. The default is 8 bit, so you get 0D 0A in current versions of R, regardless of the encoding. There are ways to declare UTF-16LE (see https://msdn.microsoft.com/en-us/library/yeby3zcb.aspx, or Google "Windows fopen" if that moves), but no other wide encoding. That's what I'm putting in place if you ask for UTF-16LE or UCS-2LE. So far I'm not planning to handle UTF-16BE or UTF-32, because doing those would mean R would have to handle the translation of LF itself, and I'm too lazy to do that. So far this is working for writes, but not reads. I still have to track down what's going wrong there. Duncan Murdoch The default eol for write.csv (via write.table) is "\n" and always gives as.raw (c (0x0d, 0x0a)), that is, as adjacent bytes. This is fine for UTF-8 but wrong for UTF-16 and UTF-32. EXAMPLE: Using UTF-32 for exaggeration (note also that 3 nul bytes are missing in the final CR+LF): df <- data.frame (x = 1:2, y = 3:4) $`UTF-32LE`$default.eol$raw [1] 22 00 00 00 78 00 00 00 22 00 00 00 2c 00 00 00 22 00 00 00 79 00 00 00 22 [26] 00 00 00 0d 0a 00 00 00 31 00 00 00 2c 00 00 00 33 00 00 00 0d 0a 00 00 00 [51] 32 00 00 00 2c 00 00 00 34 00 00 00 0d 0a 00 00 00 $`UTF-32BE`$default.eol$raw [1] 00 00 00 22 00 00 00 78 00 00 00 22 00 00 00 2c 00 00 00 22 00 00 00 79 00 [26] 00 00 22 00 00 00 0d 0a 00 00 00 31 00 00 00 2c 00 00 00 33 00 00 00 0d 0a [51] 00 00 00 32 00 00 00 2c 00 00 00 34 00 00 00 0d 0a (Nevertheless, Microsoft Excel 2013 tolerates these CSVs!) One trick/solution is to use eol = "\r" (that is, only). Regards -- Jack Kelley remove (list = objects()) print (sessionInfo()) cat ("##\n\n") ENCODING <- c ( "UTF-8", "UTF-16LE", "UTF-16BE", "UTF-16", "UTF-32LE", "UTF-32BE", "UTF-32" ) df <- data.frame (x = 1:2, y = 3:4) csv <- structure (lapply (ENCODING, function (encoding) { csv <- sprintf ("df_%s.csv", encoding) write.csv (df, csv, fileEncoding = encoding, row.names = FALSE) list (default.eol = list ( csv = csv, raw = readBin (csv, "raw", 1000)) ) }), .Names = ENCODING) EOL <- c (LF = "\n", CR = "\r", "CR+LF" = "\r\n") CSV <- structure (lapply (ENCODING, function (encoding) { structure ( lapply (names (EOL), function (EOL.name) { csv <- sprintf ("df_%s_eol=%s.csv", encoding, EOL.name) write.csv ( df, csv, fileEncoding = encoding, row.names = FALSE, eol = EOL [EOL.name] ) list (csv = csv, raw = readBin (csv, "raw", 1000)) }), .Names = names (EOL)) }), .Names = ENCODING) print (csv) print (CSV) -------------------------------- ------------ -Original Message- From: Duncan Murdoch [mailto:murdoch.dun...@gmail.com] Sent: Tuesday, 2 May 2017 04:22 To: Jack Kelley ; r-devel@r-project.org Subject: Re: [Rd] Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ? On 30/04/2017 12:23 PM, Duncan Murdoch wrote: No, I don't think anyone is working on this. There's a fairly simple workaround for the UTF-16 and UTF-32 iconv issues: don't attempt to produce character vectors, produce raw vectors instead. (The "toRaw" argument to iconv() asks for this.) Raw vectors can contain embedded nulls. Character vectors can't, because internally, R is using 8 bit C strings, and the nulls are string terminators. I don't know how difficult it would be to fix the write.table problems. I've now taken a look, and it appears as if it's not too hard. I'll see if I can work out a patch that I trust. Duncan Murdoch Duncan Murdoch On 29/04/2017 7:53 PM, Jack Kelley wrote: "R version 3.4.0 (2017-04-21)" on "x86_64-w64-mingw32" platform ... [rest omitted] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?
Now fixed in R-devel revision 72650. Duncan Murdoch On 02/05/2017 4:11 AM, Duncan Murdoch wrote: On 01/05/2017 8:49 PM, Jack Kelley wrote: Thanks for looking into this. A few notes regarding all the UTF encodings on Windows 10 ... This all stems from the ancient bad decision by Microsoft to translate LF characters to CR LF when writing text files. R passes 0A or 0A 00 or 0A 00 00 00 to the output routine (part of the C run-time), and it needs to figure out how many characters there are in those bytes in order to add the appropriate CR with the right width. The default is 8 bit, so you get 0D 0A in current versions of R, regardless of the encoding. There are ways to declare UTF-16LE (see https://msdn.microsoft.com/en-us/library/yeby3zcb.aspx, or Google "Windows fopen" if that moves), but no other wide encoding. That's what I'm putting in place if you ask for UTF-16LE or UCS-2LE. So far I'm not planning to handle UTF-16BE or UTF-32, because doing those would mean R would have to handle the translation of LF itself, and I'm too lazy to do that. So far this is working for writes, but not reads. I still have to track down what's going wrong there. Duncan Murdoch The default eol for write.csv (via write.table) is "\n" and always gives as.raw (c (0x0d, 0x0a)), that is, as adjacent bytes. This is fine for UTF-8 but wrong for UTF-16 and UTF-32. EXAMPLE: Using UTF-32 for exaggeration (note also that 3 nul bytes are missing in the final CR+LF): df <- data.frame (x = 1:2, y = 3:4) $`UTF-32LE`$default.eol$raw [1] 22 00 00 00 78 00 00 00 22 00 00 00 2c 00 00 00 22 00 00 00 79 00 00 00 22 [26] 00 00 00 0d 0a 00 00 00 31 00 00 00 2c 00 00 00 33 00 00 00 0d 0a 00 00 00 [51] 32 00 00 00 2c 00 00 00 34 00 00 00 0d 0a 00 00 00 $`UTF-32BE`$default.eol$raw [1] 00 00 00 22 00 00 00 78 00 00 00 22 00 00 00 2c 00 00 00 22 00 00 00 79 00 [26] 00 00 22 00 00 00 0d 0a 00 00 00 31 00 00 00 2c 00 00 00 33 00 00 00 0d 0a [51] 00 00 00 32 00 00 00 2c 00 00 00 34 00 00 00 0d 0a (Nevertheless, Microsoft Excel 2013 tolerates these CSVs!) One trick/solution is to use eol = "\r" (that is, only). Regards -- Jack Kelley remove (list = objects()) print (sessionInfo()) cat ("##\n\n") ENCODING <- c ( "UTF-8", "UTF-16LE", "UTF-16BE", "UTF-16", "UTF-32LE", "UTF-32BE", "UTF-32" ) df <- data.frame (x = 1:2, y = 3:4) csv <- structure (lapply (ENCODING, function (encoding) { csv <- sprintf ("df_%s.csv", encoding) write.csv (df, csv, fileEncoding = encoding, row.names = FALSE) list (default.eol = list ( csv = csv, raw = readBin (csv, "raw", 1000)) ) }), .Names = ENCODING) EOL <- c (LF = "\n", CR = "\r", "CR+LF" = "\r\n") CSV <- structure (lapply (ENCODING, function (encoding) { structure ( lapply (names (EOL), function (EOL.name) { csv <- sprintf ("df_%s_eol=%s.csv", encoding, EOL.name) write.csv ( df, csv, fileEncoding = encoding, row.names = FALSE, eol = EOL [EOL.name] ) list (csv = csv, raw = readBin (csv, "raw", 1000)) }), .Names = names (EOL)) }), .Names = ENCODING) print (csv) print (CSV) -------------------------------- ------------ -Original Message- From: Duncan Murdoch [mailto:murdoch.dun...@gmail.com] Sent: Tuesday, 2 May 2017 04:22 To: Jack Kelley ; r-devel@r-project.org Subject: Re: [Rd] Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ? On 30/04/2017 12:23 PM, Duncan Murdoch wrote: No, I don't think anyone is working on this. There's a fairly simple workaround for the UTF-16 and UTF-32 iconv issues: don't attempt to produce character vectors, produce raw vectors instead. (The "toRaw" argument to iconv() asks for this.) Raw vectors can contain embedded nulls. Character vectors can't, because internally, R is using 8 bit C strings, and the nulls are string terminators. I don't know how difficult it would be to fix the write.table problems. I've now taken a look, and it appears as if it's not too hard. I'll see if I can work out a patch that I trust. Duncan Murdoch Duncan Murdoch On 29/04/2017 7:53 PM, Jack Kelley wrote: "R version 3.4.0 (2017-04-21)" on "x86_64-w64-mingw32" platform ... [rest omitted] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel