With some off-line interaction and testing by Tal, the latest
version of the XML package (3.9-4) should resolve these issues.
So the encoding from the document is used in more cases as the default.

It is often important to specify the encoding for HTML files in
the call to htmlParse() and use "UTF-8" rather than the lower case.

I'll add code to make this simpler when I get a chance.

  Thanks Tal

    D.

On 1/30/12 5:35 AM, Tal Galili wrote:
> Hello dear R-help mailing list.
> 
> 
> 
> I wish to be able to have htmlParse work well with Hebrew, but it keeps to
> scramble the Hebrew text in pages I feed into it.
> 
> For example:
> 
> # why can't I parse the Hebrew correctly?
> 
> library(RCurl)
> library(XML)
> u = "http://humus101.com/?p=2737";
> a = getURL(u)
> a # Here - the hebrew is fine.
> a2 <- htmlParse(a)
> a2 # Here it is a mess...
> 
> None of these seem to fix it:
> 
> htmlParse(a, encoding = "utf-8")
> 
> htmlParse(a, encoding = "iso8859-8")
> 
> This is my locale:
> 
>> Sys.getlocale()
> 
> [1] 
> "LC_COLLATE=Hebrew_Israel.1255;LC_CTYPE=Hebrew_Israel.1255;LC_MONETARY=Hebrew_Israel.1255;LC_NUMERIC=C;LC_TIME=Hebrew_Israel.1255"
>>
> 
> Any suggestions?
> 
> 
> Thanks up front,
> Tal
> 
> 
> 
> ----------------Contact
> Details:-------------------------------------------------------
> Contact me: tal.gal...@gmail.com |  972-52-7275845
> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
> www.r-statistics.com (English)
> ----------------------------------------------------------------------------------------------
> 
>       [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to