Re: [R] Text Encoding

2013-04-09 Thread Emily Ottensmeyer
Dear Milan and David,

Thank you both very much for your help!  I finally figured it out.

Text on the website was UTF-8, but in the process of downloading it using
RDF, it got converted to the java/javascript encoding.  To convert it back
to UTF-8:

 test - 4.5\\u00B5g of cDNA was used
 iconv(test, JAVA, UTF-8)
[1] 4.5µg of cDNA was used

This may also impact anyone using JSON with R.  Posting here in case it
helps anyone else.  =)

-Emily


On Sat, Apr 6, 2013 at 10:37 AM, David Winsemius dwinsem...@comcast.netwrote:


 On Apr 5, 2013, at 11:30 AM, Emily Ottensmeyer wrote:

  Dear R-Help,
 
  I am using the RDF package/ R 2.14 with the RDF package to download data
  from a website, and then use R to manipulate it.
 
  Text on the website is UTF-8.  The RDF package's rdf_load command is
  converting it into a different encoding, which converts non-ASCII
  characters to unicode codes.
 
  On the webpage/sparql RDF: 4.5µg of cDNA was used
 
  In R, the RDF triple gives: 4.5\\u00B5g of cDNA was used
 
  I can't seem to convert it back from \\u00B5  into µ.
 
  I've tried iconv with various settings without success:
  iconv(test, latin1, UTF-8)
  [1] 4.5\\u00B5g of cDNA was used
 
  And, I tried Encoding, to see if I could figure that out, but it returns
  unknown on my string.
  Encoding(test)
  [1] unknown
 
 On my device entering this: 4.5\\u00B5g of cDNA was used

 ... returns [1] 4.5\\u00B5g of cDNA was used

 But entering: 4.5\u00B5g of cDNA was used returns:

 [1] 4.5µg of cDNA was used

  nchar(4.5\\u00B5g of cDNA was used)
 [1] 27
  nchar(4.5\u00B5g of cDNA was used)
 [1] 22

 So the doubled \ is really a single character in the first case  and has
 no effect in escaping the next four hex digits but \u00B5 in the second
 case is a correct micro-character (for my setup with my fonts)

 If this is a systematic problem then you should contact the maintainer
 with a full problem description and a link to the website. If this is just
 a one-off problem just remove the extraneous backslash.

 --
 David.

  sessionInfo()
 R version 3.0.0 RC (2013-03-31 r62463)
 Platform: x86_64-apple-darwin10.8.0 (64-bit)

 locale:
 [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
 snipped

  Anyone have any ideas on how to correct/convert the text encoding?
 
 
  Thanks!
  -Emily
 
[[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.

 David Winsemius
 Alameda, CA, USA



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Text Encoding

2013-04-06 Thread Milan Bouchet-Valat
Le vendredi 05 avril 2013 à 14:30 -0400, Emily Ottensmeyer a écrit :
 Dear R-Help,
 
 I am using the RDF package/ R 2.14 with the RDF package to download data
 from a website, and then use R to manipulate it.
 
 Text on the website is UTF-8.  The RDF package's rdf_load command is
 converting it into a different encoding, which converts non-ASCII
 characters to unicode codes.
 
 On the webpage/sparql RDF: 4.5g of cDNA was used
 
 In R, the RDF triple gives: 4.5\\u00B5g of cDNA was used
 
 I can't seem to convert it back from \\u00B5  into .
Beware that \\u00B5 is the micro sign (greek letter mu), not . This is
probably an important information...

 I've tried iconv with various settings without success:
  iconv(test, latin1, UTF-8)
 [1] 4.5\\u00B5g of cDNA was used
\\u00B5 looks like UTF-16, not UTF-8. Does this work?
iconv(test, UTF-16, UTF-8)

 And, I tried Encoding, to see if I could figure that out, but it returns
 unknown on my string.
  Encoding(test)
 [1] unknown
 
 
 Anyone have any ideas on how to correct/convert the text encoding?
Can you provide us the file, or at least the required parts of it?

You can also try loading the file using xmlParse() from the XML package.


Regards

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Text Encoding

2013-04-06 Thread David Winsemius

On Apr 5, 2013, at 11:30 AM, Emily Ottensmeyer wrote:

 Dear R-Help,
 
 I am using the RDF package/ R 2.14 with the RDF package to download data
 from a website, and then use R to manipulate it.
 
 Text on the website is UTF-8.  The RDF package's rdf_load command is
 converting it into a different encoding, which converts non-ASCII
 characters to unicode codes.
 
 On the webpage/sparql RDF: 4.5µg of cDNA was used
 
 In R, the RDF triple gives: 4.5\\u00B5g of cDNA was used
 
 I can't seem to convert it back from \\u00B5  into µ.
 
 I've tried iconv with various settings without success:
 iconv(test, latin1, UTF-8)
 [1] 4.5\\u00B5g of cDNA was used
 
 And, I tried Encoding, to see if I could figure that out, but it returns
 unknown on my string.
 Encoding(test)
 [1] unknown
 
On my device entering this: 4.5\\u00B5g of cDNA was used

... returns [1] 4.5\\u00B5g of cDNA was used

But entering: 4.5\u00B5g of cDNA was used returns:

[1] 4.5µg of cDNA was used

 nchar(4.5\\u00B5g of cDNA was used)
[1] 27
 nchar(4.5\u00B5g of cDNA was used)
[1] 22

So the doubled \ is really a single character in the first case  and has no 
effect in escaping the next four hex digits but \u00B5 in the second case is 
a correct micro-character (for my setup with my fonts)

If this is a systematic problem then you should contact the maintainer with a 
full problem description and a link to the website. If this is just a one-off 
problem just remove the extraneous backslash.

-- 
David.

 sessionInfo()
R version 3.0.0 RC (2013-03-31 r62463)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
snipped

 Anyone have any ideas on how to correct/convert the text encoding?
 
 
 Thanks!
 -Emily
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.