Re: [R] Text Encoding

2013-04-09 Thread Emily Ottensmeyer
Dear Milan and David,

Thank you both very much for your help!  I finally figured it out.

Text on the website was UTF-8, but in the process of downloading it using
RDF, it got converted to the java/javascript encoding.  To convert it back
to UTF-8:

> test <- "4.5\\u00B5g of cDNA was used"
> iconv(test, "JAVA", "UTF-8")
[1] "4.5µg of cDNA was used"

This may also impact anyone using JSON with R.  Posting here in case it
helps anyone else.  =)

-Emily


On Sat, Apr 6, 2013 at 10:37 AM, David Winsemius wrote:

>
> On Apr 5, 2013, at 11:30 AM, Emily Ottensmeyer wrote:
>
> > Dear R-Help,
> >
> > I am using the RDF package/ R 2.14 with the RDF package to download data
> > from a website, and then use R to manipulate it.
> >
> > Text on the website is UTF-8.  The RDF package's rdf_load command is
> > converting it into a different encoding, which converts non-ASCII
> > characters to unicode codes.
> >
> > On the webpage/sparql RDF: "4.5µg of cDNA was used"
> >
> > In R, the RDF triple gives: "4.5\\u00B5g of cDNA was used"
> >
> > I can't seem to convert it back from \\u00B5  into "µ".
> >
> > I've tried iconv with various settings without success:
> >> iconv(test, "latin1", "UTF-8")
> > [1] "4.5\\u00B5g of cDNA was used"
> >
> > And, I tried Encoding, to see if I could figure that out, but it returns
> > "unknown" on my string.
> >> Encoding(test)
> > [1] "unknown"
> >
> On my device entering this: "4.5\\u00B5g of cDNA was used"
>
> ... returns [1] "4.5\\u00B5g of cDNA was used"
>
> But entering: "4.5\u00B5g of cDNA was used" returns:
>
> [1] "4.5µg of cDNA was used"
>
> > nchar("4.5\\u00B5g of cDNA was used")
> [1] 27
> > nchar("4.5\u00B5g of cDNA was used")
> [1] 22
>
> So the doubled "\" is really a single character in the first case  and has
> no effect in escaping the next four hex digits but "\u00B5" in the second
> case is a correct "micro-character" (for my setup with my fonts)
>
> If this is a systematic problem then you should contact the maintainer
> with a full problem description and a link to the website. If this is just
> a one-off problem just remove the extraneous backslash.
>
> --
> David.
>
> > sessionInfo()
> R version 3.0.0 RC (2013-03-31 r62463)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> 
>
> > Anyone have any ideas on how to correct/convert the text encoding?
> >
> >
> > Thanks!
> > -Emily
> >
> >   [[alternative HTML version deleted]]
> >
> > __
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Text Encoding

2013-04-06 Thread David Winsemius

On Apr 5, 2013, at 11:30 AM, Emily Ottensmeyer wrote:

> Dear R-Help,
> 
> I am using the RDF package/ R 2.14 with the RDF package to download data
> from a website, and then use R to manipulate it.
> 
> Text on the website is UTF-8.  The RDF package's rdf_load command is
> converting it into a different encoding, which converts non-ASCII
> characters to unicode codes.
> 
> On the webpage/sparql RDF: "4.5µg of cDNA was used"
> 
> In R, the RDF triple gives: "4.5\\u00B5g of cDNA was used"
> 
> I can't seem to convert it back from \\u00B5  into "µ".
> 
> I've tried iconv with various settings without success:
>> iconv(test, "latin1", "UTF-8")
> [1] "4.5\\u00B5g of cDNA was used"
> 
> And, I tried Encoding, to see if I could figure that out, but it returns
> "unknown" on my string.
>> Encoding(test)
> [1] "unknown"
> 
On my device entering this: "4.5\\u00B5g of cDNA was used"

... returns [1] "4.5\\u00B5g of cDNA was used"

But entering: "4.5\u00B5g of cDNA was used" returns:

[1] "4.5µg of cDNA was used"

> nchar("4.5\\u00B5g of cDNA was used")
[1] 27
> nchar("4.5\u00B5g of cDNA was used")
[1] 22

So the doubled "\" is really a single character in the first case  and has no 
effect in escaping the next four hex digits but "\u00B5" in the second case is 
a correct "micro-character" (for my setup with my fonts)

If this is a systematic problem then you should contact the maintainer with a 
full problem description and a link to the website. If this is just a one-off 
problem just remove the extraneous backslash.

-- 
David.

> sessionInfo()
R version 3.0.0 RC (2013-03-31 r62463)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8


> Anyone have any ideas on how to correct/convert the text encoding?
> 
> 
> Thanks!
> -Emily
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Text Encoding

2013-04-06 Thread Milan Bouchet-Valat
Le vendredi 05 avril 2013 à 14:30 -0400, Emily Ottensmeyer a écrit :
> Dear R-Help,
> 
> I am using the RDF package/ R 2.14 with the RDF package to download data
> from a website, and then use R to manipulate it.
> 
> Text on the website is UTF-8.  The RDF package's rdf_load command is
> converting it into a different encoding, which converts non-ASCII
> characters to unicode codes.
> 
> On the webpage/sparql RDF: "4.5g of cDNA was used"
> 
> In R, the RDF triple gives: "4.5\\u00B5g of cDNA was used"
> 
> I can't seem to convert it back from \\u00B5  into "".
Beware that \\u00B5 is the micro sign (greek letter mu), not "". This is
probably an important information...

> I've tried iconv with various settings without success:
> > iconv(test, "latin1", "UTF-8")
> [1] "4.5\\u00B5g of cDNA was used"
\\u00B5 looks like UTF-16, not UTF-8. Does this work?
iconv(test, "UTF-16", "UTF-8")

> And, I tried Encoding, to see if I could figure that out, but it returns
> "unknown" on my string.
> > Encoding(test)
> [1] "unknown"
> 
> 
> Anyone have any ideas on how to correct/convert the text encoding?
Can you provide us the file, or at least the required parts of it?

You can also try loading the file using xmlParse() from the XML package.


Regards

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.