Re: [R] Text Encoding
Dear Milan and David, Thank you both very much for your help! I finally figured it out. Text on the website was UTF-8, but in the process of downloading it using RDF, it got converted to the java/javascript encoding. To convert it back to UTF-8: test - 4.5\\u00B5g of cDNA was used iconv(test, JAVA, UTF-8) [1] 4.5µg of cDNA was used This may also impact anyone using JSON with R. Posting here in case it helps anyone else. =) -Emily On Sat, Apr 6, 2013 at 10:37 AM, David Winsemius dwinsem...@comcast.netwrote: On Apr 5, 2013, at 11:30 AM, Emily Ottensmeyer wrote: Dear R-Help, I am using the RDF package/ R 2.14 with the RDF package to download data from a website, and then use R to manipulate it. Text on the website is UTF-8. The RDF package's rdf_load command is converting it into a different encoding, which converts non-ASCII characters to unicode codes. On the webpage/sparql RDF: 4.5µg of cDNA was used In R, the RDF triple gives: 4.5\\u00B5g of cDNA was used I can't seem to convert it back from \\u00B5 into µ. I've tried iconv with various settings without success: iconv(test, latin1, UTF-8) [1] 4.5\\u00B5g of cDNA was used And, I tried Encoding, to see if I could figure that out, but it returns unknown on my string. Encoding(test) [1] unknown On my device entering this: 4.5\\u00B5g of cDNA was used ... returns [1] 4.5\\u00B5g of cDNA was used But entering: 4.5\u00B5g of cDNA was used returns: [1] 4.5µg of cDNA was used nchar(4.5\\u00B5g of cDNA was used) [1] 27 nchar(4.5\u00B5g of cDNA was used) [1] 22 So the doubled \ is really a single character in the first case and has no effect in escaping the next four hex digits but \u00B5 in the second case is a correct micro-character (for my setup with my fonts) If this is a systematic problem then you should contact the maintainer with a full problem description and a link to the website. If this is just a one-off problem just remove the extraneous backslash. -- David. sessionInfo() R version 3.0.0 RC (2013-03-31 r62463) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 snipped Anyone have any ideas on how to correct/convert the text encoding? Thanks! -Emily [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Text Encoding
Le vendredi 05 avril 2013 à 14:30 -0400, Emily Ottensmeyer a écrit : Dear R-Help, I am using the RDF package/ R 2.14 with the RDF package to download data from a website, and then use R to manipulate it. Text on the website is UTF-8. The RDF package's rdf_load command is converting it into a different encoding, which converts non-ASCII characters to unicode codes. On the webpage/sparql RDF: 4.5g of cDNA was used In R, the RDF triple gives: 4.5\\u00B5g of cDNA was used I can't seem to convert it back from \\u00B5 into . Beware that \\u00B5 is the micro sign (greek letter mu), not . This is probably an important information... I've tried iconv with various settings without success: iconv(test, latin1, UTF-8) [1] 4.5\\u00B5g of cDNA was used \\u00B5 looks like UTF-16, not UTF-8. Does this work? iconv(test, UTF-16, UTF-8) And, I tried Encoding, to see if I could figure that out, but it returns unknown on my string. Encoding(test) [1] unknown Anyone have any ideas on how to correct/convert the text encoding? Can you provide us the file, or at least the required parts of it? You can also try loading the file using xmlParse() from the XML package. Regards __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Text Encoding
On Apr 5, 2013, at 11:30 AM, Emily Ottensmeyer wrote: Dear R-Help, I am using the RDF package/ R 2.14 with the RDF package to download data from a website, and then use R to manipulate it. Text on the website is UTF-8. The RDF package's rdf_load command is converting it into a different encoding, which converts non-ASCII characters to unicode codes. On the webpage/sparql RDF: 4.5µg of cDNA was used In R, the RDF triple gives: 4.5\\u00B5g of cDNA was used I can't seem to convert it back from \\u00B5 into µ. I've tried iconv with various settings without success: iconv(test, latin1, UTF-8) [1] 4.5\\u00B5g of cDNA was used And, I tried Encoding, to see if I could figure that out, but it returns unknown on my string. Encoding(test) [1] unknown On my device entering this: 4.5\\u00B5g of cDNA was used ... returns [1] 4.5\\u00B5g of cDNA was used But entering: 4.5\u00B5g of cDNA was used returns: [1] 4.5µg of cDNA was used nchar(4.5\\u00B5g of cDNA was used) [1] 27 nchar(4.5\u00B5g of cDNA was used) [1] 22 So the doubled \ is really a single character in the first case and has no effect in escaping the next four hex digits but \u00B5 in the second case is a correct micro-character (for my setup with my fonts) If this is a systematic problem then you should contact the maintainer with a full problem description and a link to the website. If this is just a one-off problem just remove the extraneous backslash. -- David. sessionInfo() R version 3.0.0 RC (2013-03-31 r62463) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 snipped Anyone have any ideas on how to correct/convert the text encoding? Thanks! -Emily [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.