I thiknk that I have to install Linux on VM... There is a shortest way by the way, could you please advise how to rebuild 'XML' package for R with latest libxml sources? Who may do that? or is it possible to build the new R package based on another non-C sorced parsers based like on PyPY, erlang and so on?
2013/2/22 Milan Bouchet-Valat <nalimi...@club.fr> > Le jeudi 21 février 2013 à 18:53 +0400, Lawr Eskin a écrit : > > iconv trued before in various try, same issue and result with encoding > > = unknown > > now try sub - same issue > This procedure works on Linux, but not on Windows: > > library(RCurl) > library(XML) > u <- "http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1" > a <- getURL(u, .encoding="UTF-8") > a <- iconv(a, "windows-1251", "UTF-8") > a2 <- htmlParse(sub("windows-1251", "UTF-8", a)) > a2 > > But maybe the problem is more general, and related to conversion between > encodings on Windows. What looks weird to me is that on Windows, I'm not > able to save a character string to a file in UTF-8, despite what ?file > says: > x <- "ÐÑе пÑава заÑиÑенÑ" > Encoding(x) > # UTF-8 > cat(x, con <- file("foo", "w", encoding="UTF-8")); close(con) > x2 <- readLines(con <- file(foo, "r", encoding="UTF-8")); close(con) > Encoding(x2) > # unknown > x2 > # [1] "<U+041A><U+0443>..." > > I know the problem happens on write because the file cannot be read > correctly on Linux either. > > This Windows machine uses Windows Server 2008 with French_France.1252 > locale. > > > 2013/2/21 Milan Bouchet-Valat <nalimi...@club.fr> > > Le jeudi 21 février 2013 à 18:31 +0400, Lawr Eskin a écrit : > > > Hi Milan, > > > > > > a <- getURL(con, .encoding = "UTF-8") > > > Encoding(a) > > > > [1] "UTF-8" > > > a # Here - the UTF-8 codes looks like fine. > > > htmlParse(a, encoding = "UTF-8") ###again same encoding > > issue > > > > And what if you try this: > > a2 <- htmlParse(sub("windows-1251", "UTF-8", a)) > > > > or this: > > a2 <- htmlParse(iconv(a, "windows-1251", "UTF-8")) > > > > > > Cheers > > > > > > > >>why didn't getURL() detect and set a's encoding correctly? > > > I think there are page issue because another sites works > > fine > > > > > > 2013/2/21 Milan Bouchet-Valat <nalimi...@club.fr> > > > Le jeudi 21 février 2013 à 16:04 +0400, Lawr Eskin a > > écrit : > > > > Hi Milan! > > > > > > > > > > > > > Encoding(a) > > > > [1] "unknown" > > > > > > Hm, here I get "UTF-8", which is my locale encoding. > > > > > > I've tried a little more, and I discovered that > > using > > > a <- getURL(u, .encoding="UTF-8") > > > ensures that a is in the correct encoding here. I > > know this is > > > not your > > > problem, but it might help: check whether > > Encoding(a) is set > > > to "UTF-8" > > > or not in that case, and whether this fixes things. > > > > > > I'm not sure how htmlParse() detects the encoding > > when you > > > pass it a > > > character vector, but it probably uses Encoding(a), > > since > > > that's the > > > only reliable information; if it is missing, maybe > > it falls > > > back to what > > > the contents of the file say (maybe even before what > > the > > > "encoding" > > > argument says), which is windows-1251, and may not > > be the > > > encoding in > > > which getURL() saved the character vector. The > > question would > > > then be: > > > why didn't getURL() detect and set a's encoding > > correctly? > > > > > > > > > My two cents > > > > > > > > > > 2013/2/21 Milan Bouchet-Valat <nalimi...@club.fr> > > > > Le jeudi 21 février 2013 à 13:16 +0400, > > Lawr Eskin a > > > écrit : > > > > > Hello dear R-help mailing list. > > > > > > > > > > > > > > > Looks like the same issue in Russian: > > > > > > > > > > > > > > > > > > > > library(RCurl) > > > > > > > > > > library(XML) > > > > > > > > > > u = " > > > > > > > > > http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1" > > > > > > > > > > a = getURL(u) > > > > > > > > > > a # Here - the Russian is fine. > > > > > > > > > > a2 <- htmlParse(a) > > > > > > > > > > a2 # Here it is a mess... > > > > > > > > > > > > > > > > > > > > None of these seem to fix it: > > > > > > > > > > > > > > > > > > > > htmlParse(a, encoding = "windows-1251") > > > > > > > > > > htmlParse(a, encoding = "CP1251") > > > > > > > > > > htmlParse(a, encoding = "cp1251") > > > > > > > > > > htmlParse(a, encoding = "iso8859-5") > > > > > > > > > > > > > > > > > > > > This is my locale: > > > > > > > > > > > > > > > > > > > > Sys.getlocale() > > > > > > > > > > > > > > > > > > > > "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251" > > > > > > > > > > > > > > > > > > > > Any suggestions? > > > > > > > > What does Encoding(a) say? > > > > > > > > > > > > (FWIW, here on Linux even a is not in the > > correct > > > encoding : > > > > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML > > 4.0 > > > Transitional//EN" > > > > > > "http://www.w3.org/TR/REC-html40/loose.dtd"> > > > > <html><head> > > > > <title>ÐÐÐÑÐÐÐÐÐÐÐÑ > > ЮФÐÂЮÐÐЮЬÐÂÐ ÐÐР> > ÐÑÐÑ ÐÐÐÑÐ > > > аÐÐÐÐа > > > > ÐÑ ÐÑ ÐÐЮР> > > > ±ÐÐÐÑÐÒ Ðâ 11430 > > ЮÐÐÐÑÐÑÐÑЫÐÒÐÂÐÐЩ Ю > > ÐÐаЮФР> > > ЦÐÒ Ð®Ð¤Ð > > > > ЮÐÐЮЬР> > > > Ð ÐÐÐÂле ÐÐÐÑÐ > > аÐÐÐÐа</title> > > > > [...]) > > > > > > > > > > > > Regards > > > > > > > > > > > > > Thanks you very much in advance, > > > > > > > > > > Lavrentiy Eskin > > > > > > > > > <http://www.eng.nvg.ru> > > > > > > > > > > [[alternative HTML version > > deleted]] > > > > > > > > > > > > ______________________________________________ > > > > > R-help@r-project.org mailing list > > > > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > > PLEASE do read the posting guide > > > > > > http://www.R-project.org/posting-guide.html > > > > > and provide commented, minimal, > > self-contained, > > > reproducible > > > > code. > > > > > > > > > > > > > > > > > > > > > > > > > [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.