On Dec 6, 2011, at 22:33 , Gene Leynes wrote: > Mark, > > Thanks for your suggestions. > > That's a good idea about the NULL columns; I didn't think of that. > Surprisingly, it didn't have any effect on the time.
Hmm, I think you want "character" and "NULL" there (i.e., quoted). Did you fix both? >> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), >> rep(NULL,3696)). As a general matter, if you want people to dig into this, they need some paraphrase of the file to play with. Would it be possible to set up a small R program that generates a data file which displays the issue? Everything I try seems to take about a second to read in. -pd > > This problem was just a curiosity, I already did the import using Excel and > VBA. I was just going to illustrate the power and simplicity of R, but it > ironically it's been much slower and harder in R... > The VBA was painful and messy, and took me over an hour to write; but at > least it worked quickly and reliably. > The R code was clean and only took me about 5 minutes to write, but the run > time was prohibitively slow! > > I profiled the code, but that offers little insight to me. > > Profile results with 10 line file: > >> summaryRprof("C:/Users/gene.leynes/Desktop/test.out") > $by.self > self.time self.pct total.time total.pct > scan 12.24 53.50 12.24 53.50 > read.table 10.58 46.24 22.88 100.00 > type.convert 0.04 0.17 0.04 0.17 > make.names 0.02 0.09 0.02 0.09 > > $by.total > total.time total.pct self.time self.pct > read.table 22.88 100.00 10.58 46.24 > scan 12.24 53.50 12.24 53.50 > type.convert 0.04 0.17 0.04 0.17 > make.names 0.02 0.09 0.02 0.09 > > $sample.interval > [1] 0.02 > > $sampling.time > [1] 22.88 > > > Profile results with 250 line file: > >> summaryRprof("C:/Users/gene.leynes/Desktop/test.out") > $by.self > self.time self.pct total.time total.pct > scan 23.88 68.15 23.88 68.15 > read.table 10.78 30.76 35.04 100.00 > type.convert 0.30 0.86 0.32 0.91 > character 0.02 0.06 0.02 0.06 > file 0.02 0.06 0.02 0.06 > lapply 0.02 0.06 0.02 0.06 > unlist 0.02 0.06 0.02 0.06 > > $by.total > total.time total.pct self.time self.pct > read.table 35.04 100.00 10.78 30.76 > scan 23.88 68.15 23.88 68.15 > type.convert 0.32 0.91 0.30 0.86 > sapply 0.04 0.11 0.00 0.00 > character 0.02 0.06 0.02 0.06 > file 0.02 0.06 0.02 0.06 > lapply 0.02 0.06 0.02 0.06 > unlist 0.02 0.06 0.02 0.06 > simplify2array 0.02 0.06 0.00 0.00 > > $sample.interval > [1] 0.02 > > $sampling.time > [1] 35.04 > > > > > On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds <marklee...@gmail.com> wrote: > >> hi gene: maybe someone else will reply with some subtleties that I'm not >> aware of. one other thing >> that might help: if you know which columns you want , you can set the >> others to NULL through >> colClasses and this should speed things up also. For example, say you knew >> you only wanted the >> first four columns and they were character. then you could do, >> >> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), >> rep(NULL,3696)). >> >> hopefully someone else will say something that does the trick. it seems >> odd to me as far as the >> difference in timings ? good luck. >> >> >> >> >> >> On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes <gley...@gmail.com> wrote: >> >>> Mark, >>> >>> Thank you for the reply >>> >>> I neglected to mention that I had already set >>> options(stringsAsFactors=FALSE) >>> >>> I agree, skipping the factor determination can help performance. >>> >>> The main reason that I wanted to use read.table is because it will >>> correctly determine the column classes for me. I don't really want to >>> specify 3700 column classes! (I'm not sure what they are anyway). >>> >>> >>> On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds <marklee...@gmail.com> wrote: >>> >>>> Hi Gene: Sometimes using colClasses in read.table can speed things up. >>>> If you know what your variables are ahead of time and what you want them to >>>> be, this allows you to be specific by specifying, character or numeric, >>>> etc and often it makes things faster. others will have more to say. >>>> >>>> also, if most of your variables are characters, R will try to turn >>>> convert them into factors by default. If you use as.is = TRUE it won't >>>> do this and that might speed things up also. >>>> >>>> >>>> Rejoinder: above tidbits are just from experience. I don't know if >>>> it's in stone or a hard and fast rule. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes <gley...@gmail.com> wrote: >>>> >>>>> ** Disclaimer: I'm looking for general suggestions ** >>>>> I'm sorry, but can't send out the file I'm using, so there is no >>>>> reproducible example. >>>>> >>>>> I'm using read.table and it's taking over 30 seconds to read a tiny >>>>> file. >>>>> The strange thing is that it takes roughly the same amount of time if >>>>> the >>>>> file is 100 times larger. >>>>> >>>>> After re-reviewing the data Import / Export manual I think the best >>>>> approach would be to use Python, or perhaps the readLines function, but >>>>> I >>>>> was hoping to understand why the simple read.table approach wasn't >>>>> working >>>>> as expected. >>>>> >>>>> Some relevant facts: >>>>> >>>>> 1. There are about 3700 columns. Maybe this is the problem? Still >>>>> the >>>>> >>>>> file size is not very large. >>>>> 2. The file encoding is ANSI, but I'm not specifying that in the >>>>> >>>>> function. Setting fileEncoding="ANSI" produces an "unsupported >>>>> conversion" >>>>> error >>>>> 3. readLines imports the lines quickly >>>>> 4. scan imports the file quickly also >>>>> >>>>> >>>>> Obviously, scan and readLines would require more coding to identify >>>>> columns, etc. >>>>> >>>>> my code: >>>>> system.time(dat <- read.table('C:/test.txt', nrows=-1, sep='\t', >>>>> header=TRUE)) >>>>> >>>>> It's taking 33.4 seconds and the file size is only 315 kb! >>>>> >>>>> Thanks >>>>> >>>>> Gene >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>> >>>> >>> >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd....@cbs.dk Priv: pda...@gmail.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.