Re: [R] Non-ACSII characters in R on Windows

2014-01-19 Thread Duncan Murdoch

On 13-09-17 9:01 AM, Duncan Murdoch wrote:

On 13-09-17 8:15 AM, Milan Bouchet-Valat wrote:

Le lundi 16 septembre 2013 à 20:04 +0400, Maxim Linchits a écrit :

Here is that old post:
http://r.789695.n4.nabble.com/read-csv-and-FileEncoding-in-Windows-version-of-R-2-13-0-td3567177.html

A taste: Again, the issue is that opening this UTF-8 encoded file
under R 2.13.0 yields an error, but opening it under R 2.12.2 works
without any issues. (...)

I have tried with R 2.12.2 both 32 and 64 bit on Windows Server 2008
with the French (CP1252) locale, and I still experience an error with
the test case I provided in previous messages. So it does not sound like
it is the same issue.



I can reproduce the error with a file sent to me by Maxim.  From a quick
look, I suspect that changes will be needed to read.table to handle
this, and they'll be large enough that they won't make it into 3.0.2,
but hopefully will go into R-patched after the release.



This took a lot longer than expected, but some changes are now in 
R-devel (as of r64831).  Files that can't be displayed in the local 
encoding are not necessarily displayed well, but they should be stored 
properly.  Please let me know of any problems.


I think the display issues can be improved, not sure about solved.

Duncan Murdoch

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Non-ACSII characters in R on Windows

2013-09-19 Thread Duncan Murdoch

On 13-09-19 5:06 AM, Maxim Linchits wrote:

Have any of the thread participants sent a bug report to R? If not,
let me know if you intend to so so. Otherwise, I'll send a report
myself.


There's no bug, as far as I know.  The issue is that various functions 
(by design) convert strings to the local encoding, and in the example 
you were trying, the local encoding can't represent all the characters, 
so they are shown using the hex codes, and things get messed up.


I'm currently looking into changing the design, so that there is more 
use of UTF-8 internally.  This is likely to have side effects, which 
need to be investigated carefully.


Duncan Murdoch



thanks

On Tue, Sep 17, 2013 at 5:01 PM, Duncan Murdoch
murdoch.dun...@gmail.com wrote:

On 13-09-17 8:15 AM, Milan Bouchet-Valat wrote:


Le lundi 16 septembre 2013 à 20:04 +0400, Maxim Linchits a écrit :


Here is that old post:

http://r.789695.n4.nabble.com/read-csv-and-FileEncoding-in-Windows-version-of-R-2-13-0-td3567177.html

A taste: Again, the issue is that opening this UTF-8 encoded file
under R 2.13.0 yields an error, but opening it under R 2.12.2 works
without any issues. (...)


I have tried with R 2.12.2 both 32 and 64 bit on Windows Server 2008
with the French (CP1252) locale, and I still experience an error with
the test case I provided in previous messages. So it does not sound like
it is the same issue.




I can reproduce the error with a file sent to me by Maxim.  From a quick
look, I suspect that changes will be needed to read.table to handle this,
and they'll be large enough that they won't make it into 3.0.2, but
hopefully will go into R-patched after the release.

Duncan Murdoch


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Non-ACSII characters in R on Windows

2013-09-17 Thread Milan Bouchet-Valat
Le lundi 16 septembre 2013 à 20:04 +0400, Maxim Linchits a écrit :
 Here is that old post:
 http://r.789695.n4.nabble.com/read-csv-and-FileEncoding-in-Windows-version-of-R-2-13-0-td3567177.html
 
 A taste: Again, the issue is that opening this UTF-8 encoded file
 under R 2.13.0 yields an error, but opening it under R 2.12.2 works
 without any issues. (...)
I have tried with R 2.12.2 both 32 and 64 bit on Windows Server 2008
with the French (CP1252) locale, and I still experience an error with
the test case I provided in previous messages. So it does not sound like
it is the same issue.


Regards

 On Mon, Sep 16, 2013 at 6:38 PM, Milan Bouchet-Valat nalimi...@club.fr 
 wrote:
  Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit :
  Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit :
   This is a condensed version of the same question on stackexchange here:
   http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell
   If you've already stumbled upon it feel free to ignore.
  
   My problem is that R on US Windows does not read *any* text file that
   contains *any* foreign characters. It simply reads the first consecutive 
   n
   ASCII characters and then throws a warning once it reached a foreign
   character:
  
test - read.table(test.txt, sep=;, dec=,, quote=,
   fileEncoding=UTF-8)
   Warning messages:
   1: In read.table(test.txt, sep = ;, dec = ,, quote = , 
   fileEncoding
   = UTF-8) :
 invalid input found on input connection 'test.txt'
   2: In read.table(test.txt, sep = ;, dec = ,, quote = , 
   fileEncoding
   = UTF-8) :
 incomplete final line found by readTableHeader on 'test.txt'
print(test)
  V1
   1 english
  
Sys.getlocale()
  [1] LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
   States.1252;
LC_MONETARY=English_United
   States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
  
  
   It is important to note that that R on linux will read UTF-8 as well as
   exotic character sets without a problem. I've tried it with the exact 
   same
   files (one was UTF-8 and another was OEM866 Cyrillic).
  
   If I do not include the fileEncoding parameter, read.table will read the
   whole CSV file. But naturally it will read it wrong because it does not
   know the encoding. So whenever I try to specify the fileEncoding, R will
   throw the warnings and stop once it reaches a foreign character. It's the
   same story with all international character encodings.
   Other users on stackexchange have reported exactly the same issue.
  
  
   Is anyone here who is on a US version of Windows able to import files 
   with
   foreign characters? Please let me know.
  A reproducible example would have helped, as requested by the posting
  guide.
 
  Though I am also experiencing the same problem after saving the data
  below to a CSV file encoded in UTF-8 (you can do this using even the
  Notepad):
  Ա,Բ
  1,10
  2,20
 
  This is on a Windows 7 box using French locale, but same codepage 1252
  as yours. What is interesting is that reading the file using
  readLines(file(myFile.csv, encoding=UTF-8))
  gives no invalid characters. So there must be a bug in read.table().
 
 
  But I must note I do not experience issues with French accentuated
  characters like é (\Ue9). On the contrary, reading Armenian
  characters like Ա (\U531) gives weird results: the character appears
  as U+0531 instead of Ա.
 
  Self-contained example, writing the file and reading it back from R:
  tmpfile - tempfile()
  writeLines(\U531, file(tmpfile, w, encoding=UTF-8))
  readLines(file(tmpfile, encoding=UTF-8))
  # U+0531
 
  The same phenomenon happens when creating a data frame from this
  character (as noted on StackExchange):
  data.frame(\U531)
 
  So my conclusion is that maybe Windows does not really support Unicode
  characters that are not relevant for your current locale. And that may
  have created bugs in the way R handles them in read.table(). R
  developers can probably tell us more about it.
  After some more investigation, one part of the problem can be traced
  back to scan() (with myFile.csv filled as described above):
  scan(myFile.csv, encoding=UTF-8, sep=,, nlines=1)
  # Read 2 items
  # [1] Ա Բ
 
  Equivalent, but nonsensical to me:
  scan(myFile.csv, fileEncoding=CP1252, encoding=UTF-8, sep=,, 
  nlines=1)
  # Read 2 items
  # [1] Ա Բ
 
  scan(myFile.csv, fileEncoding=UTF-8, sep=,, nlines=1)
  # Read 0 items
  # character(0)
  # Warning message:
  # In scan(file, what, nmax, sep, dex, quote, skip, nlines, na.strings,  :
  #  invalid input found on input connection 'myFile.csv'
 
 
  So there seem to be one part of the issue in scan(), which for some
  reason does not work when passed fileEncoding=UTF-8; and another part
  in read.table(), which transforms Ա (\U531) into X.U.0531.,
  probably via make.names(), since:
  make.names(\U531)
  # X.U.0531.
 
 
  Does this make sense to R-core 

Re: [R] Non-ACSII characters in R on Windows

2013-09-17 Thread Duncan Murdoch

On 13-09-17 8:15 AM, Milan Bouchet-Valat wrote:

Le lundi 16 septembre 2013 à 20:04 +0400, Maxim Linchits a écrit :

Here is that old post:
http://r.789695.n4.nabble.com/read-csv-and-FileEncoding-in-Windows-version-of-R-2-13-0-td3567177.html

A taste: Again, the issue is that opening this UTF-8 encoded file
under R 2.13.0 yields an error, but opening it under R 2.12.2 works
without any issues. (...)

I have tried with R 2.12.2 both 32 and 64 bit on Windows Server 2008
with the French (CP1252) locale, and I still experience an error with
the test case I provided in previous messages. So it does not sound like
it is the same issue.



I can reproduce the error with a file sent to me by Maxim.  From a quick 
look, I suspect that changes will be needed to read.table to handle 
this, and they'll be large enough that they won't make it into 3.0.2, 
but hopefully will go into R-patched after the release.


Duncan Murdoch

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Non-ACSII characters in R on Windows

2013-09-16 Thread Milan Bouchet-Valat
Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit :
 This is a condensed version of the same question on stackexchange here:
 http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell
 If you've already stumbled upon it feel free to ignore.
 
 My problem is that R on US Windows does not read *any* text file that
 contains *any* foreign characters. It simply reads the first consecutive n
 ASCII characters and then throws a warning once it reached a foreign
 character:
 
  test - read.table(test.txt, sep=;, dec=,, quote=,
 fileEncoding=UTF-8)
 Warning messages:
 1: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding
 = UTF-8) :
   invalid input found on input connection 'test.txt'
 2: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding
 = UTF-8) :
   incomplete final line found by readTableHeader on 'test.txt'
  print(test)
V1
 1 english
 
  Sys.getlocale()
[1] LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
 States.1252;
  LC_MONETARY=English_United
 States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
 
 
 It is important to note that that R on linux will read UTF-8 as well as
 exotic character sets without a problem. I've tried it with the exact same
 files (one was UTF-8 and another was OEM866 Cyrillic).
 
 If I do not include the fileEncoding parameter, read.table will read the
 whole CSV file. But naturally it will read it wrong because it does not
 know the encoding. So whenever I try to specify the fileEncoding, R will
 throw the warnings and stop once it reaches a foreign character. It's the
 same story with all international character encodings.
 Other users on stackexchange have reported exactly the same issue.
 
 
 Is anyone here who is on a US version of Windows able to import files with
 foreign characters? Please let me know.
A reproducible example would have helped, as requested by the posting
guide.

Though I am also experiencing the same problem after saving the data
below to a CSV file encoded in UTF-8 (you can do this using even the
Notepad):
Ա,Բ
1,10
2,20

This is on a Windows 7 box using French locale, but same codepage 1252
as yours. What is interesting is that reading the file using
readLines(file(myFile.csv, encoding=UTF-8))
gives no invalid characters. So there must be a bug in read.table().


But I must note I do not experience issues with French accentuated
characters like é (\Ue9). On the contrary, reading Armenian
characters like Ա (\U531) gives weird results: the character appears
as U+0531 instead of Ա.

Self-contained example, writing the file and reading it back from R:
tmpfile - tempfile()
writeLines(\U531, file(tmpfile, w, encoding=UTF-8))
readLines(file(tmpfile, encoding=UTF-8))
# U+0531

The same phenomenon happens when creating a data frame from this
character (as noted on StackExchange):
data.frame(\U531)

So my conclusion is that maybe Windows does not really support Unicode
characters that are not relevant for your current locale. And that may
have created bugs in the way R handles them in read.table(). R
developers can probably tell us more about it.


Regards

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Non-ACSII characters in R on Windows

2013-09-16 Thread Milan Bouchet-Valat
Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit :
 Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit :
  This is a condensed version of the same question on stackexchange here:
  http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell
  If you've already stumbled upon it feel free to ignore.
  
  My problem is that R on US Windows does not read *any* text file that
  contains *any* foreign characters. It simply reads the first consecutive n
  ASCII characters and then throws a warning once it reached a foreign
  character:
  
   test - read.table(test.txt, sep=;, dec=,, quote=,
  fileEncoding=UTF-8)
  Warning messages:
  1: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding
  = UTF-8) :
invalid input found on input connection 'test.txt'
  2: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding
  = UTF-8) :
incomplete final line found by readTableHeader on 'test.txt'
   print(test)
 V1
  1 english
  
   Sys.getlocale()
 [1] LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
  States.1252;
   LC_MONETARY=English_United
  States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
  
  
  It is important to note that that R on linux will read UTF-8 as well as
  exotic character sets without a problem. I've tried it with the exact same
  files (one was UTF-8 and another was OEM866 Cyrillic).
  
  If I do not include the fileEncoding parameter, read.table will read the
  whole CSV file. But naturally it will read it wrong because it does not
  know the encoding. So whenever I try to specify the fileEncoding, R will
  throw the warnings and stop once it reaches a foreign character. It's the
  same story with all international character encodings.
  Other users on stackexchange have reported exactly the same issue.
  
  
  Is anyone here who is on a US version of Windows able to import files with
  foreign characters? Please let me know.
 A reproducible example would have helped, as requested by the posting
 guide.
 
 Though I am also experiencing the same problem after saving the data
 below to a CSV file encoded in UTF-8 (you can do this using even the
 Notepad):
 Ա,Բ
 1,10
 2,20
 
 This is on a Windows 7 box using French locale, but same codepage 1252
 as yours. What is interesting is that reading the file using
 readLines(file(myFile.csv, encoding=UTF-8))
 gives no invalid characters. So there must be a bug in read.table().
 
 
 But I must note I do not experience issues with French accentuated
 characters like é (\Ue9). On the contrary, reading Armenian
 characters like Ա (\U531) gives weird results: the character appears
 as U+0531 instead of Ա.
 
 Self-contained example, writing the file and reading it back from R:
 tmpfile - tempfile()
 writeLines(\U531, file(tmpfile, w, encoding=UTF-8))
 readLines(file(tmpfile, encoding=UTF-8))
 # U+0531
 
 The same phenomenon happens when creating a data frame from this
 character (as noted on StackExchange):
 data.frame(\U531)
 
 So my conclusion is that maybe Windows does not really support Unicode
 characters that are not relevant for your current locale. And that may
 have created bugs in the way R handles them in read.table(). R
 developers can probably tell us more about it.
After some more investigation, one part of the problem can be traced
back to scan() (with myFile.csv filled as described above):
scan(myFile.csv, encoding=UTF-8, sep=,, nlines=1)
# Read 2 items
# [1] Ա Բ

Equivalent, but nonsensical to me:
scan(myFile.csv, fileEncoding=CP1252, encoding=UTF-8, sep=,, nlines=1)
# Read 2 items
# [1] Ա Բ

scan(myFile.csv, fileEncoding=UTF-8, sep=,, nlines=1)
# Read 0 items
# character(0)
# Warning message:
# In scan(file, what, nmax, sep, dex, quote, skip, nlines, na.strings,  :
#  invalid input found on input connection 'myFile.csv'


So there seem to be one part of the issue in scan(), which for some
reason does not work when passed fileEncoding=UTF-8; and another part
in read.table(), which transforms Ա (\U531) into X.U.0531.,
probably via make.names(), since:
make.names(\U531)
# X.U.0531.


Does this make sense to R-core members?


Regards

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Non-ACSII characters in R on Windows

2013-09-16 Thread Ista Zahn
UTF-8 on windows is a huge pain, this bites me often. Usually I give
up and do the analysis on a Linux server. In previous struggles with
this I've found this blog post enlightening:
https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/

Best,
Ista

On Mon, Sep 16, 2013 at 10:38 AM, Milan Bouchet-Valat nalimi...@club.fr wrote:
 Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit :
 Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit :
  This is a condensed version of the same question on stackexchange here:
  http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell
  If you've already stumbled upon it feel free to ignore.
 
  My problem is that R on US Windows does not read *any* text file that
  contains *any* foreign characters. It simply reads the first consecutive n
  ASCII characters and then throws a warning once it reached a foreign
  character:
 
   test - read.table(test.txt, sep=;, dec=,, quote=,
  fileEncoding=UTF-8)
  Warning messages:
  1: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding
  = UTF-8) :
invalid input found on input connection 'test.txt'
  2: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding
  = UTF-8) :
incomplete final line found by readTableHeader on 'test.txt'
   print(test)
 V1
  1 english
 
   Sys.getlocale()
 [1] LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
  States.1252;
   LC_MONETARY=English_United
  States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
 
 
  It is important to note that that R on linux will read UTF-8 as well as
  exotic character sets without a problem. I've tried it with the exact same
  files (one was UTF-8 and another was OEM866 Cyrillic).
 
  If I do not include the fileEncoding parameter, read.table will read the
  whole CSV file. But naturally it will read it wrong because it does not
  know the encoding. So whenever I try to specify the fileEncoding, R will
  throw the warnings and stop once it reaches a foreign character. It's the
  same story with all international character encodings.
  Other users on stackexchange have reported exactly the same issue.
 
 
  Is anyone here who is on a US version of Windows able to import files with
  foreign characters? Please let me know.
 A reproducible example would have helped, as requested by the posting
 guide.

 Though I am also experiencing the same problem after saving the data
 below to a CSV file encoded in UTF-8 (you can do this using even the
 Notepad):
 Ա,Բ
 1,10
 2,20

 This is on a Windows 7 box using French locale, but same codepage 1252
 as yours. What is interesting is that reading the file using
 readLines(file(myFile.csv, encoding=UTF-8))
 gives no invalid characters. So there must be a bug in read.table().


 But I must note I do not experience issues with French accentuated
 characters like é (\Ue9). On the contrary, reading Armenian
 characters like Ա (\U531) gives weird results: the character appears
 as U+0531 instead of Ա.

 Self-contained example, writing the file and reading it back from R:
 tmpfile - tempfile()
 writeLines(\U531, file(tmpfile, w, encoding=UTF-8))
 readLines(file(tmpfile, encoding=UTF-8))
 # U+0531

 The same phenomenon happens when creating a data frame from this
 character (as noted on StackExchange):
 data.frame(\U531)

 So my conclusion is that maybe Windows does not really support Unicode
 characters that are not relevant for your current locale. And that may
 have created bugs in the way R handles them in read.table(). R
 developers can probably tell us more about it.
 After some more investigation, one part of the problem can be traced
 back to scan() (with myFile.csv filled as described above):
 scan(myFile.csv, encoding=UTF-8, sep=,, nlines=1)
 # Read 2 items
 # [1] Ա Բ

 Equivalent, but nonsensical to me:
 scan(myFile.csv, fileEncoding=CP1252, encoding=UTF-8, sep=,, nlines=1)
 # Read 2 items
 # [1] Ա Բ

 scan(myFile.csv, fileEncoding=UTF-8, sep=,, nlines=1)
 # Read 0 items
 # character(0)
 # Warning message:
 # In scan(file, what, nmax, sep, dex, quote, skip, nlines, na.strings,  :
 #  invalid input found on input connection 'myFile.csv'


 So there seem to be one part of the issue in scan(), which for some
 reason does not work when passed fileEncoding=UTF-8; and another part
 in read.table(), which transforms Ա (\U531) into X.U.0531.,
 probably via make.names(), since:
 make.names(\U531)
 # X.U.0531.


 Does this make sense to R-core members?


 Regards

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 

Re: [R] Non-ACSII characters in R on Windows

2013-09-16 Thread Ista Zahn
Hi Duncan,

I've put an example file online at
https://docs.google.com/file/d/0B73Ve8vxnjR6QnRESXBQTHRUME0/edit?usp=sharing,
with a screenshot showing the expected contents of the file at
https://docs.google.com/file/d/0B73Ve8vxnjR6b1ZSQmtsRXdadVU/edit?usp=sharing

Hopefully you'll find this easy and the rest of us can feel dumb for
not having figured it out...

Thanks,
Ista

On Mon, Sep 16, 2013 at 1:39 PM, Duncan Murdoch
murdoch.dun...@gmail.com wrote:
 On 16/09/2013 12:04 PM, Maxim Linchits wrote:

 Here is that old post:

 http://r.789695.n4.nabble.com/read-csv-and-FileEncoding-in-Windows-version-of-R-2-13-0-td3567177.html


 In that post, you'll see I asked for a sample file.  I never received any
 reply; presumably some spam filter didn't like what Alexander sent me, and
 Nabble doesn't archive any attachment.

 Similarly, the Stackoverflow thread contains no sample data.

 Could someone who is having this problem please put a small sample online
 for download?  As I told Alexander last time, my experiments with files I
 constructed myself showed no errors.

 Duncan Murdoch



 A taste: Again, the issue is that opening this UTF-8 encoded file
 under R 2.13.0 yields an error, but opening it under R 2.12.2 works
 without any issues. (...)

 On Mon, Sep 16, 2013 at 6:38 PM, Milan Bouchet-Valat nalimi...@club.fr
 wrote:
  Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit :
  Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit :
   This is a condensed version of the same question on stackexchange
   here:
  
   http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell
   If you've already stumbled upon it feel free to ignore.
  
   My problem is that R on US Windows does not read *any* text file that
   contains *any* foreign characters. It simply reads the first
   consecutive n
   ASCII characters and then throws a warning once it reached a foreign
   character:
  
test - read.table(test.txt, sep=;, dec=,, quote=,
   fileEncoding=UTF-8)
   Warning messages:
   1: In read.table(test.txt, sep = ;, dec = ,, quote = ,
   fileEncoding
   = UTF-8) :
 invalid input found on input connection 'test.txt'
   2: In read.table(test.txt, sep = ;, dec = ,, quote = ,
   fileEncoding
   = UTF-8) :
 incomplete final line found by readTableHeader on 'test.txt'
print(test)
  V1
   1 english
  
Sys.getlocale()
  [1] LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
   States.1252;
LC_MONETARY=English_United
   States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
  
  
   It is important to note that that R on linux will read UTF-8 as well
   as
   exotic character sets without a problem. I've tried it with the exact
   same
   files (one was UTF-8 and another was OEM866 Cyrillic).
  
   If I do not include the fileEncoding parameter, read.table will read
   the
   whole CSV file. But naturally it will read it wrong because it does
   not
   know the encoding. So whenever I try to specify the fileEncoding, R
   will
   throw the warnings and stop once it reaches a foreign character. It's
   the
   same story with all international character encodings.
   Other users on stackexchange have reported exactly the same issue.
  
  
   Is anyone here who is on a US version of Windows able to import files
   with
   foreign characters? Please let me know.
  A reproducible example would have helped, as requested by the posting
  guide.
 
  Though I am also experiencing the same problem after saving the data
  below to a CSV file encoded in UTF-8 (you can do this using even the
  Notepad):
  Ա,Բ
  1,10
  2,20
 
  This is on a Windows 7 box using French locale, but same codepage 1252
  as yours. What is interesting is that reading the file using
  readLines(file(myFile.csv, encoding=UTF-8))
  gives no invalid characters. So there must be a bug in read.table().
 
 
  But I must note I do not experience issues with French accentuated
  characters like é (\Ue9). On the contrary, reading Armenian
  characters like Ա (\U531) gives weird results: the character
  appears
  as U+0531 instead of Ա.
 
  Self-contained example, writing the file and reading it back from R:
  tmpfile - tempfile()
  writeLines(\U531, file(tmpfile, w, encoding=UTF-8))
  readLines(file(tmpfile, encoding=UTF-8))
  # U+0531
 
  The same phenomenon happens when creating a data frame from this
  character (as noted on StackExchange):
  data.frame(\U531)
 
  So my conclusion is that maybe Windows does not really support Unicode
  characters that are not relevant for your current locale. And that
  may
  have created bugs in the way R handles them in read.table(). R
  developers can probably tell us more about it.
  After some more investigation, one part of the problem can be traced
  back to scan() (with myFile.csv filled as described above):
  scan(myFile.csv, encoding=UTF-8, sep=,, nlines=1)
  # Read 2 items
  # [1] Ա Բ
 
  Equivalent, but nonsensical 

Re: [R] Non-ACSII characters in R on Windows

2013-09-16 Thread Milan Bouchet-Valat
Le lundi 16 septembre 2013 à 13:39 -0400, Duncan Murdoch a écrit :
 On 16/09/2013 12:04 PM, Maxim Linchits wrote:
  Here is that old post:
  http://r.789695.n4.nabble.com/read-csv-and-FileEncoding-in-Windows-version-of-R-2-13-0-td3567177.html
 
 In that post, you'll see I asked for a sample file.  I never received 
 any reply; presumably some spam filter didn't like what Alexander sent 
 me, and Nabble doesn't archive any attachment.
 
 Similarly, the Stackoverflow thread contains no sample data.
 
 Could someone who is having this problem please put a small sample 
 online for download?  As I told Alexander last time, my experiments with 
 files I constructed myself showed no errors.
Yes, this was my first reaction, and then I saw the link to a second
thread on StackOverflow with such an example. This is the one I took in
my previous posts in this thread. If you want to get the file directly
instead of pasting the contents it by hand, here is a version that
should be enough:
http://nalimilan.perso.neuf.fr/transfert/utf8.csv


Regards

 Duncan Murdoch
 
 
  A taste: Again, the issue is that opening this UTF-8 encoded file
  under R 2.13.0 yields an error, but opening it under R 2.12.2 works
  without any issues. (...)
 
  On Mon, Sep 16, 2013 at 6:38 PM, Milan Bouchet-Valat nalimi...@club.fr 
  wrote:
   Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit :
   Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit :
This is a condensed version of the same question on stackexchange here:
http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell
If you've already stumbled upon it feel free to ignore.
   
My problem is that R on US Windows does not read *any* text file that
contains *any* foreign characters. It simply reads the first 
consecutive n
ASCII characters and then throws a warning once it reached a foreign
character:
   
 test - read.table(test.txt, sep=;, dec=,, quote=,
fileEncoding=UTF-8)
Warning messages:
1: In read.table(test.txt, sep = ;, dec = ,, quote = , 
fileEncoding
= UTF-8) :
  invalid input found on input connection 'test.txt'
2: In read.table(test.txt, sep = ;, dec = ,, quote = , 
fileEncoding
= UTF-8) :
  incomplete final line found by readTableHeader on 'test.txt'
 print(test)
   V1
1 english
   
 Sys.getlocale()
   [1] LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;
 LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
   
   
It is important to note that that R on linux will read UTF-8 as well as
exotic character sets without a problem. I've tried it with the exact 
same
files (one was UTF-8 and another was OEM866 Cyrillic).
   
If I do not include the fileEncoding parameter, read.table will read 
the
whole CSV file. But naturally it will read it wrong because it does not
know the encoding. So whenever I try to specify the fileEncoding, R 
will
throw the warnings and stop once it reaches a foreign character. It's 
the
same story with all international character encodings.
Other users on stackexchange have reported exactly the same issue.
   
   
Is anyone here who is on a US version of Windows able to import files 
with
foreign characters? Please let me know.
   A reproducible example would have helped, as requested by the posting
   guide.
  
   Though I am also experiencing the same problem after saving the data
   below to a CSV file encoded in UTF-8 (you can do this using even the
   Notepad):
   Ա,Բ
   1,10
   2,20
  
   This is on a Windows 7 box using French locale, but same codepage 1252
   as yours. What is interesting is that reading the file using
   readLines(file(myFile.csv, encoding=UTF-8))
   gives no invalid characters. So there must be a bug in read.table().
  
  
   But I must note I do not experience issues with French accentuated
   characters like é (\Ue9). On the contrary, reading Armenian
   characters like Ա (\U531) gives weird results: the character appears
   as U+0531 instead of Ա.
  
   Self-contained example, writing the file and reading it back from R:
   tmpfile - tempfile()
   writeLines(\U531, file(tmpfile, w, encoding=UTF-8))
   readLines(file(tmpfile, encoding=UTF-8))
   # U+0531
  
   The same phenomenon happens when creating a data frame from this
   character (as noted on StackExchange):
   data.frame(\U531)
  
   So my conclusion is that maybe Windows does not really support Unicode
   characters that are not relevant for your current locale. And that may
   have created bugs in the way R handles them in read.table(). R
   developers can probably tell us more about it.
   After some more investigation, one part of the problem can be traced
   back to scan() (with myFile.csv filled as described above):
   scan(myFile.csv, encoding=UTF-8, sep=,, nlines=1)
 

Re: [R] Non-ACSII characters in R on Windows

2013-09-16 Thread Duncan Murdoch

On 16/09/2013 12:04 PM, Maxim Linchits wrote:

Here is that old post:
http://r.789695.n4.nabble.com/read-csv-and-FileEncoding-in-Windows-version-of-R-2-13-0-td3567177.html


In that post, you'll see I asked for a sample file.  I never received 
any reply; presumably some spam filter didn't like what Alexander sent 
me, and Nabble doesn't archive any attachment.


Similarly, the Stackoverflow thread contains no sample data.

Could someone who is having this problem please put a small sample 
online for download?  As I told Alexander last time, my experiments with 
files I constructed myself showed no errors.


Duncan Murdoch



A taste: Again, the issue is that opening this UTF-8 encoded file
under R 2.13.0 yields an error, but opening it under R 2.12.2 works
without any issues. (...)

On Mon, Sep 16, 2013 at 6:38 PM, Milan Bouchet-Valat nalimi...@club.fr wrote:
 Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit :
 Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit :
  This is a condensed version of the same question on stackexchange here:
  
http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell
  If you've already stumbled upon it feel free to ignore.
 
  My problem is that R on US Windows does not read *any* text file that
  contains *any* foreign characters. It simply reads the first consecutive n
  ASCII characters and then throws a warning once it reached a foreign
  character:
 
   test - read.table(test.txt, sep=;, dec=,, quote=,
  fileEncoding=UTF-8)
  Warning messages:
  1: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding
  = UTF-8) :
invalid input found on input connection 'test.txt'
  2: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding
  = UTF-8) :
incomplete final line found by readTableHeader on 'test.txt'
   print(test)
 V1
  1 english
 
   Sys.getlocale()
 [1] LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
  States.1252;
   LC_MONETARY=English_United
  States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
 
 
  It is important to note that that R on linux will read UTF-8 as well as
  exotic character sets without a problem. I've tried it with the exact same
  files (one was UTF-8 and another was OEM866 Cyrillic).
 
  If I do not include the fileEncoding parameter, read.table will read the
  whole CSV file. But naturally it will read it wrong because it does not
  know the encoding. So whenever I try to specify the fileEncoding, R will
  throw the warnings and stop once it reaches a foreign character. It's the
  same story with all international character encodings.
  Other users on stackexchange have reported exactly the same issue.
 
 
  Is anyone here who is on a US version of Windows able to import files with
  foreign characters? Please let me know.
 A reproducible example would have helped, as requested by the posting
 guide.

 Though I am also experiencing the same problem after saving the data
 below to a CSV file encoded in UTF-8 (you can do this using even the
 Notepad):
 Ա,Բ
 1,10
 2,20

 This is on a Windows 7 box using French locale, but same codepage 1252
 as yours. What is interesting is that reading the file using
 readLines(file(myFile.csv, encoding=UTF-8))
 gives no invalid characters. So there must be a bug in read.table().


 But I must note I do not experience issues with French accentuated
 characters like é (\Ue9). On the contrary, reading Armenian
 characters like Ա (\U531) gives weird results: the character appears
 as U+0531 instead of Ա.

 Self-contained example, writing the file and reading it back from R:
 tmpfile - tempfile()
 writeLines(\U531, file(tmpfile, w, encoding=UTF-8))
 readLines(file(tmpfile, encoding=UTF-8))
 # U+0531

 The same phenomenon happens when creating a data frame from this
 character (as noted on StackExchange):
 data.frame(\U531)

 So my conclusion is that maybe Windows does not really support Unicode
 characters that are not relevant for your current locale. And that may
 have created bugs in the way R handles them in read.table(). R
 developers can probably tell us more about it.
 After some more investigation, one part of the problem can be traced
 back to scan() (with myFile.csv filled as described above):
 scan(myFile.csv, encoding=UTF-8, sep=,, nlines=1)
 # Read 2 items
 # [1] Ա Բ

 Equivalent, but nonsensical to me:
 scan(myFile.csv, fileEncoding=CP1252, encoding=UTF-8, sep=,, nlines=1)
 # Read 2 items
 # [1] Ա Բ

 scan(myFile.csv, fileEncoding=UTF-8, sep=,, nlines=1)
 # Read 0 items
 # character(0)
 # Warning message:
 # In scan(file, what, nmax, sep, dex, quote, skip, nlines, na.strings,  :
 #  invalid input found on input connection 'myFile.csv'


 So there seem to be one part of the issue in scan(), which for some
 reason does not work when passed fileEncoding=UTF-8; and another part
 in read.table(), which transforms Ա (\U531) into X.U.0531.,
 probably via make.names(), since: