[R] Chinese characters in html source captured by download.file() are garbled code , how to convert it readable
Dear list, I am working with R to download numerous html source code from which the data extracted will be further processed. The problem is the Chinese character in the html source code are all garbled and I can't really find a way to convert them to something readable. This problem persists on ubuntu-10 and win-7, English environment. Not try Operating system in Chinese yet. I know literally nothing about encoding and a comprehensive search online does not save me from this woe. # the code download.file( https://www.google.com.hk/finance/company_news?q=SHA:601857gl=cnnum=200 ,destfile=tmp.txt) test-readLines(tmp.txt,encoding=UTF-8) #the garbled code in tmp.txt and test is like below #��#22269;�۪o�ѵM�a�ѥ��������q�]� Any help is highly appreciated. yong [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to speed up the for loop by releasing memeory
Dear list; How can I speed up the run of following code (illustrative) # con-vector(numeric) for (i in 1:limit) { if(matched data for the ith item found) { if(i==1) {con-RowOfMatchedData } else {con-rbind(con,matchedData)} } } # each RowOfMatchedData contains 105 variables, when i runs over 10^7 and the data container con get large enough, the codes get extremely slow, I know this is a working memory problem (2GB only), is there anyway to circumvent this problem without dicing and slicing the data. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] mlogit package, Error in X[omitlines, ] - NA : subscript out of bounds
I am using the mlogit packages and get a data problem, for which I can't find any clue from R archive. code below shows my related code all the way to the error #--- mydata - data.frame(dependent,x,y,z) mydata$dependent-as.factor(mydata$dependent) mldata-mlogit.data(mydata, varying=NULL, choice=dependent, shape=wide) summary(mlogit.1- mlogit(dependent~1|x+y+z, data = mldata, reflevel=0)) Error in X[omitlines, ] - NA : subscript out of bounds , #--- Could anybody kindly tip how can I possibly solve this problem? Thank you yong __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] any package for Heckman selection model when the outcome equation also probit ?
Hi, all Can anybody hint if there is extant package or function to deal with Heckman selection model where the outcome model is also probit? In stata, it is called heckprob. Thank you yong __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] read.table() with \t as seperator, all other programs report equal fields each row, read.table() returns unequal row length error
hi, list R is undoudtedly my favorite statistic tool, however, the data inputnpart has long been a pain. most data I have to deal with are irregular and contains special character. Recently I get a tab delimited data, read.table(filename,sep=\t) constantly return erors for certain rows does not has xyz elements while all other programs such as perl,python, awk all report equal row length if use \t as seperator. I scout through the problematic row, sometimes it is because a row contains a #, so I go back to specify comment.char= next it will be some other problems, for some rows I simply can't figure out what the problem is. can I have any guru suggestion to save this pain now and in the future, is CSV a safer format? or can anyone let me know what are the fundamental principles I must bear in mind when do preliminary data processing using other programs such as perl to ensure the output can be readily feed into R. best yong __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] file reading problem unique to windows. Err info: Error in file(file, ifelse(append, a, w)). cannot open the connection
Thanks a lot, Prof. Ripley. the problem must be download.file() prior R-patched 2.12. for each loop, I tired a couple of candidate links and only one or none will work, this must accumulate soon to an amount of unclosed destination files beyond the tolerance of windows. I updated R from 2.10 to R-patched 2.12 and the problem gone. On Thu, Nov 25, 2010 at 3:09 AM, Prof Brian Ripley rip...@stats.ox.ac.uk wrote: We don't have any of the information asked for in the posting guide, such as your version of R, reproducible example But please try R-patched, since this might be • download.file() could leave the destination file open if the URL was not able to be opened. (PR#14414) (If you had followed the posting guide you would have tried R-patched before posting ) On Wed, 24 Nov 2010, Yong Wang wrote: Dear List I asked this question before, got some tips but can't get it solved. Where? You didn't give a reference, and it would have helped the helpers. it is clear now that this problem only occurs when run on windows (I tested it on windows XP) nothing wrong at all when run on unix. unfortunately, sometimes I have to run it on windows, I appreciate any suggestion on how to circumvent this problem when run on windows. below is the problem description I submitted before. # I am running a loop downloading web pages and save the html to a temporary file (use download.file() ) then read (using readLines) it in for processing; finally write useful info from each processed page to a unique file the problem is once the loop runs up to somewhere near 5000, it will throw out an err like below and won't go further. Error in file(file, ifelse(append, a, w)) : cannot open the connection - In the meantime, a request for new connection won't be successful, for example, a request for the help page of file will trigger err below --- ?file Error in gzfile(file, rb) : cannot open the connection In addition: Warning message: In gzfile(file, rb) : cannot open compressed file 'C:/PROGRA~1/R/R-211~1.1/library/stats/help/aliases.rds', probable reason 'Too many open files' --- I am not sure if the problem is too many connections not closed. since I close the file connection after each readLines. checking with showConnections(all=T) does not show excessive connections and closeAllConnections() does not help. Can any one help me on this? Any answer highly appreciated. yong __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] file reading problem unique to windows. Err info: Error in file(file, ifelse(append, a, w)). cannot open the connection
Dear List I asked this question before, got some tips but can't get it solved. it is clear now that this problem only occurs when run on windows (I tested it on windows XP) nothing wrong at all when run on unix. unfortunately, sometimes I have to run it on windows, I appreciate any suggestion on how to circumvent this problem when run on windows. below is the problem description I submitted before. # I am running a loop downloading web pages and save the html to a temporary file (use download.file() ) then read (using readLines) it in for processing; finally write useful info from each processed page to a unique file the problem is once the loop runs up to somewhere near 5000, it will throw out an err like below and won't go further. Error in file(file, ifelse(append, a, w)) : cannot open the connection - In the meantime, a request for new connection won't be successful, for example, a request for the help page of file will trigger err below --- ?file Error in gzfile(file, rb) : cannot open the connection In addition: Warning message: In gzfile(file, rb) : cannot open compressed file 'C:/PROGRA~1/R/R-211~1.1/library/stats/help/aliases.rds', probable reason 'Too many open files' --- I am not sure if the problem is too many connections not closed. since I close the file connection after each readLines. checking with showConnections(all=T) does not show excessive connections and closeAllConnections() does not help. Can any one help me on this? Any answer highly appreciated. yong __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] what does this err mean and how to solve it? Error in file(file, ifelse(append, a, w))
Dear List I am running a loop downloading web pages and save the html to a temporary file (use download.file() ) then read (using readLines) it in for processing; finally write useful info from each processed page to a unique file the problem is once the loop runs up to somewhere near 5000, it will throw out an err like below and won't go further. Error in file(file, ifelse(append, a, w)) : cannot open the connection - In the meantime, a request for new connection won't be successful, for example, a request for the help page of file will trigger err below --- ?file Error in gzfile(file, rb) : cannot open the connection In addition: Warning message: In gzfile(file, rb) : cannot open compressed file 'C:/PROGRA~1/R/R-211~1.1/library/stats/help/aliases.rds', probable reason 'Too many open files' --- I am not sure if the problem is too many connections not closed. since I close the file connection after each readLines. checking with showConnections(all=T) does not show excessive connections and closeAllConnections() does not help. Can any one help me on this? Any answer highly appreciated. yong __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to store regex expression in a variable
dear list I know how to store a regex expression in perl and ruby, no clue on R. I do read R regex manual , archives, and searched on line, still I need somebody help me out on how to store a regular expression in a variable. Thank you very much yong __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Read in a all-character file and specify field separator and records separator
Dear list I used to use python or awk do preliminary process and then feed into R. In some circumstances, the data transmission becomes quite a pain. I am wondering if there is a convenient way to read in R text file (not data, text file in common sense) and specify field separator and records separator, so the whole work can be reduced to one-stop shopping. or simply, is there one simple way to read in the text file with each row in a single column. scan(sep=\n) does not work as expected. Thanks yong __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to execute multiple R scripts sequentially in unix background
Dear list I need to 1) run several R scripts sequentially due to results waiting and 2) run them in unix background since my ssh frequently timeout for some reason. if paste following codes to unix R --vanilla script1 R --vanilla script2 R --vanilla script3 will result in simultaneous instead of sequential execution of the three scripts. source() might be an alternative, however, I am not clear how to run it in the background. Thanks yong __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Why eval(parse(text=var(vec))) return a matrix but NOT a number?
Dear List I am getting a problem when using eval(parse). Code below sketchs what I am trying to do: For each row of a N*K dataframe (I use a 2*2 dataframe in the example below), applying a number of functions and get the outputs (two functions, sum and var are used in the example below). The problem is eval(parse(text=sum(para))) works fine but not when sum is replaced by var. in the later case, a matrix instead of a number is returned. Any suggestion highly appreciated. Thank you #===The function myloop -function(datfra,funs) { rows-dim(datfra)[1]; totfunnum-length(funs); for (i in 1:rows) { vec-datfra[i,]; for(k in 1:totfunnum) { print(funs[k]); x-eval(parse(text=funs[k])); print(x); } } } #Experiemental run workport-data.frame(matrix(1:4,2,2)) funs-c(sum(vec,na.rm=T),var(vec,na.rm=T)) myloop(workport,funs) # Outputs of the Experimental run [1] sum(vec,na.rm=T) [1] 4 [1] var(vec,na.rm=T) X1 X2 X1 NA NA X2 NA NA [1] sum(vec,na.rm=T) [1] 6 [1] var(vec,na.rm=T) X1 X2 X1 NA NA X2 NA NA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] for Interaction of continous var and categorical var, any way approach the categorical var to continous ?
Dear list, This is NOT a techical question ragrding use of R. I have a linear model where the response variable is neigborhood safety . It is projected poverty deteriorate safety and number of officers per thousand residents improve safety. The focal hypothesis is poverty poses less safety threat when officers number is high. To check the focal hypothesis, the continuous variable officers is recoded as catogorical with two levels (high and low). the results is below and support the hyothesis #= model - lm(neigborhood safety ~ poverty * officers) The coefficients (all significant): poverty-0.05 officers 0.058 poverty : officers0.014 #== My question is how to check the weakened poverty effect with a minuscle increase of officers. the coeeficient for the interaction term of continous poverty and officers is hard to interpret and is not suitable to check the focal hypothesis since, say, (povety=3 officers=8) will be the same as (poverty=8 officer=3). Thanks a lot in advance for any suggestions! Sincerely, Will __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] UNIX Installation of package systemfit fails
Dear list I am trying to install the systemfit package under unix, install.packages(systemfit) the installation failed. I am attaching the error and version information below, (if dependencies=TRUE, much more error) any help appreciated best, yong = install.packages(systemfit) Warning in install.packages(systemfit) : argument 'lib' is missing: using '/usr/home/d/068/meta/R/x86_64-unknown-linux-gnu-library/2.7' --- Please select a CRAN mirror for use in this session --- CRAN mirror 1: Argentina 2: Australia 3: Austria 4: Belgium 5: Brazil (PR) 6: Brazil (RJ) 7: Brazil (SP 1) 8: Brazil (SP 2) 9: Canada (BC)10: Canada (ON) 11: Chile 12: China 13: Croatia14: Denmark 15: France (Toulouse) 16: France (Lyon) 17: France (Paris) 18: Germany (Goettingen) 19: Germany (Muenchen) 20: Iran 21: Ireland22: Italy (Milano) 23: Italy (Padua) 24: Italy (Palermo) 25: Japan (Aizu) 26: Japan (Tokyo) 27: Japan (Tsukuba)28: Korea 29: Mexico 30: Netherlands (Amsterdam 2) 31: Netherlands (Utrecht) 32: New Zealand 33: Norway 34: Poland (Oswiecim) 35: Poland (Wroclaw) 36: Portugal 37: Russia 38: Singapore 1 39: Singapore 240: Slovenia (Ljubljana) 41: South Africa 42: Spain (Madrid) 43: Sweden 44: Switzerland 45: Taiwan (Taichung) 46: Taiwan (Taipeh) 47: Thailand 48: Turkey 49: UK (Bristol) 50: USA (CA 1) 51: USA (CA 2) 52: USA (IA) 53: USA (MI) 54: USA (MO) 55: USA (NC) 56: USA (OH) 57: USA (PA 1) 58: USA (PA 2) 59: USA (TX 1) 60: USA (TX 2) 61: USA (WA) Selection: 57 also installing the dependencies âzooâ, âMatrixâ, âcarâ, âlmtestâ trying URL 'http://lib.stat.cmu.edu/R/CRAN/src/contrib/zoo_1.5-4.tar.gz' Content type 'application/x-gzip' length 609057 bytes (594 Kb) opened URL == downloaded 594 Kb trying URL 'http://lib.stat.cmu.edu/R/CRAN/src/contrib/Matrix_0.999375-20.tar.gz' Content type 'application/x-gzip' length 1954872 bytes (1.9 Mb) opened URL == downloaded 1.9 Mb trying URL 'http://lib.stat.cmu.edu/R/CRAN/src/contrib/car_1.2-12.tar.gz' Content type 'application/x-gzip' length 220728 bytes (215 Kb) opened URL == downloaded 215 Kb trying URL 'http://lib.stat.cmu.edu/R/CRAN/src/contrib/lmtest_0.9-22.tar.gz' Content type 'application/x-gzip' length 191099 bytes (186 Kb) opened URL == downloaded 186 Kb trying URL 'http://lib.stat.cmu.edu/R/CRAN/src/contrib/systemfit_1.0-8.tar.gz' Content type 'application/x-gzip' length 727116 bytes (710 Kb) opened URL == downloaded 710 Kb ERROR: failed to lock directory '/usr/home/d/068/wangyong/R/x86_64-unknown-linux-gnu-library/2.7' for modifying Try removing '/usr/home/d/068/wangyong/R/x86_64-unknown-linux-gnu-library/2.7/00LOCK' ERROR: failed to lock directory '/usr/home/d/068/wangyong/R/x86_64-unknown-linux-gnu-library/2.7' for modifying Try removing '/usr/home/d/068/wangyong/R/x86_64-unknown-linux-gnu-library/2.7/00LOCK' ERROR: failed to lock directory '/usr/home/d/068/wangyong/R/x86_64-unknown-linux-gnu-library/2.7' for modifying Try removing '/usr/home/d/068/wangyong/R/x86_64-unknown-linux-gnu-library/2.7/00LOCK' ERROR: failed to lock directory '/usr/home/d/068/wangyong/R/x86_64-unknown-linux-gnu-library/2.7' for modifying Try removing '/usr/home/d/068/wangyong/R/x86_64-unknown-linux-gnu-library/2.7/00LOCK' ERROR: failed to lock directory '/usr/home/d/068/wangyong/R/x86_64-unknown-linux-gnu-library/2.7' for modifying Try removing '/usr/home/d/068/wangyong/R/x86_64-unknown-linux-gnu-library/2.7/00LOCK' The downloaded packages are in /tmp/RtmpJ28hPv/downloaded_packages Warning messages: 1: In install.packages(systemfit) : installation of package 'zoo' had non-zero exit status 2: In install.packages(systemfit) : installation of package 'Matrix' had non-zero exit status 3: In install.packages(systemfit) : installation of package 'car' had non-zero exit status 4: In install.packages(systemfit) : installation of package 'lmtest' had non-zero exit status 5: In install.packages(systemfit) : installation of package 'systemfit' had non-zero exit status version _ platform x86_64-unknown-linux-gnu arch x86_64 os linux-gnu system x86_64, linux-gnu status major 2 minor 7.0 year 2008 month 04 day22
[R] problem of unix package installation, following code no response: install.packages(packagename, dependecise=TRUE)
Dear list I am trying installing a package under unix, the command as below works in some case but not some other cases, the primary syndrome is R will stop there with a message: (say I am trying to install the package SASxport) ### Warning in install.packages(SASxport, dependencies = TRUE) : argument 'lib' is missing: using '/usr/home/d/068/wangyong/R/x86_64-unknown-linux-gnu-library/2.7' Warning: unable to access index for repository http://lib.stat.cmu.edu/R/CRAN/src/contrib # My internet connection has no problem, why this happen? or can you suggest some more options or examples to follow for package installation under unix? Thank you will __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] help on package or code for simutaneous equation probit(logit) model
Dear List I am trying to fit a simutaneous equation logit model. i.e., the response variables of the structured equations are binomial, I am not sure if systemfit can do this job. A google search doesn't yield too much helpful information. Your knowledge on any other packages or codes are appreciated. Thanks will __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] lm error and how to sidestep an error occured in for loop to keep it going without being interrupted
Dear Rlist I am runing a for loop on a large dataset to do exploring investigation. Code embedded in the loop include the lm routine. Unfortunately, for some specification of dependent variable, the loop will be interrupted by error as below: Error in `contrasts-`(`*tmp*`, value = contr.treatment) : contrasts can be applied only to factors with 2 or more levels I suspect this might be caused by missing value which, once removed, will left some factors has value only on one level. It turnss out this is not true. Answers for following two questions appreciated. 1. what might be the possible reason behind the error mesage 2. if I simply want to circumvent this error and keep the for loop going, how should I do that. Regards young __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.