R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file verbatim: system.time(read.table("test2.txt"))
Michael 2011/12/7 Gene Leynes <gley...@gmail.com>: > Peter, > > You're quite right; it's nearly impossible to make progress without a > working example. > > I created an ** extremely simplified ** example for distribution. The real > data has numeric, character, and boolean classes. > > The file still takes 25.08 seconds to read, despite it's small size. > > I neglected to mention that I'm using R 2.13.0 and I"m on a windows 7 > machine (not that it should particularly matter with this type of data / > functions). > > ## The code: > options(stringsAsFactors=FALSE) > system.time(dat <- read.table('test2.txt', nrows=-1, sep='\t', header=TRUE)) > str(dat, 0) > > > Thanks again! > > > > On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard <pda...@gmail.com> wrote: > >> >> On Dec 6, 2011, at 22:33 , Gene Leynes wrote: >> >> > Mark, >> > >> > Thanks for your suggestions. >> > >> > That's a good idea about the NULL columns; I didn't think of that. >> > Surprisingly, it didn't have any effect on the time. >> >> Hmm, I think you want "character" and "NULL" there (i.e., quoted). Did you >> fix both? >> >> >> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), >> >> rep(NULL,3696)). >> >> As a general matter, if you want people to dig into this, they need some >> paraphrase of the file to play with. Would it be possible to set up a small >> R program that generates a data file which displays the issue? Everything I >> try seems to take about a second to read in. >> >> -pd >> >> > >> > This problem was just a curiosity, I already did the import using Excel >> and >> > VBA. I was just going to illustrate the power and simplicity of R, but >> it >> > ironically it's been much slower and harder in R... >> > The VBA was painful and messy, and took me over an hour to write; but at >> > least it worked quickly and reliably. >> > The R code was clean and only took me about 5 minutes to write, but the >> run >> > time was prohibitively slow! >> > >> > I profiled the code, but that offers little insight to me. >> > >> > Profile results with 10 line file: >> > >> >> summaryRprof("C:/Users/gene.leynes/Desktop/test.out") >> > $by.self >> > self.time self.pct total.time total.pct >> > scan 12.24 53.50 12.24 53.50 >> > read.table 10.58 46.24 22.88 100.00 >> > type.convert 0.04 0.17 0.04 0.17 >> > make.names 0.02 0.09 0.02 0.09 >> > >> > $by.total >> > total.time total.pct self.time self.pct >> > read.table 22.88 100.00 10.58 46.24 >> > scan 12.24 53.50 12.24 53.50 >> > type.convert 0.04 0.17 0.04 0.17 >> > make.names 0.02 0.09 0.02 0.09 >> > >> > $sample.interval >> > [1] 0.02 >> > >> > $sampling.time >> > [1] 22.88 >> > >> > >> > Profile results with 250 line file: >> > >> >> summaryRprof("C:/Users/gene.leynes/Desktop/test.out") >> > $by.self >> > self.time self.pct total.time total.pct >> > scan 23.88 68.15 23.88 68.15 >> > read.table 10.78 30.76 35.04 100.00 >> > type.convert 0.30 0.86 0.32 0.91 >> > character 0.02 0.06 0.02 0.06 >> > file 0.02 0.06 0.02 0.06 >> > lapply 0.02 0.06 0.02 0.06 >> > unlist 0.02 0.06 0.02 0.06 >> > >> > $by.total >> > total.time total.pct self.time self.pct >> > read.table 35.04 100.00 10.78 30.76 >> > scan 23.88 68.15 23.88 68.15 >> > type.convert 0.32 0.91 0.30 0.86 >> > sapply 0.04 0.11 0.00 0.00 >> > character 0.02 0.06 0.02 0.06 >> > file 0.02 0.06 0.02 0.06 >> > lapply 0.02 0.06 0.02 0.06 >> > unlist 0.02 0.06 0.02 0.06 >> > simplify2array 0.02 0.06 0.00 0.00 >> > >> > $sample.interval >> > [1] 0.02 >> > >> > $sampling.time >> > [1] 35.04 >> > >> > >> > >> > >> > On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds <marklee...@gmail.com> wrote: >> > >> >> hi gene: maybe someone else will reply with some subtleties that I'm >> not >> >> aware of. one other thing >> >> that might help: if you know which columns you want , you can set the >> >> others to NULL through >> >> colClasses and this should speed things up also. For example, say you >> knew >> >> you only wanted the >> >> first four columns and they were character. then you could do, >> >> >> >> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), >> >> rep(NULL,3696)). >> >> >> >> hopefully someone else will say something that does the trick. it seems >> >> odd to me as far as the >> >> difference in timings ? good luck. >> >> >> >> >> >> >> >> >> >> >> >> On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes <gley...@gmail.com> wrote: >> >> >> >>> Mark, >> >>> >> >>> Thank you for the reply >> >>> >> >>> I neglected to mention that I had already set >> >>> options(stringsAsFactors=FALSE) >> >>> >> >>> I agree, skipping the factor determination can help performance. >> >>> >> >>> The main reason that I wanted to use read.table is because it will >> >>> correctly determine the column classes for me. I don't really want to >> >>> specify 3700 column classes! (I'm not sure what they are anyway). >> >>> >> >>> >> >>> On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds <marklee...@gmail.com> >> wrote: >> >>> >> >>>> Hi Gene: Sometimes using colClasses in read.table can speed things up. >> >>>> If you know what your variables are ahead of time and what you want >> them to >> >>>> be, this allows you to be specific by specifying, character or >> numeric, >> >>>> etc and often it makes things faster. others will have more to say. >> >>>> >> >>>> also, if most of your variables are characters, R will try to turn >> >>>> convert them into factors by default. If you use as.is = TRUE it >> won't >> >>>> do this and that might speed things up also. >> >>>> >> >>>> >> >>>> Rejoinder: above tidbits are just from experience. I don't know if >> >>>> it's in stone or a hard and fast rule. >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes <gley...@gmail.com> >> wrote: >> >>>> >> >>>>> ** Disclaimer: I'm looking for general suggestions ** >> >>>>> I'm sorry, but can't send out the file I'm using, so there is no >> >>>>> reproducible example. >> >>>>> >> >>>>> I'm using read.table and it's taking over 30 seconds to read a tiny >> >>>>> file. >> >>>>> The strange thing is that it takes roughly the same amount of time if >> >>>>> the >> >>>>> file is 100 times larger. >> >>>>> >> >>>>> After re-reviewing the data Import / Export manual I think the best >> >>>>> approach would be to use Python, or perhaps the readLines function, >> but >> >>>>> I >> >>>>> was hoping to understand why the simple read.table approach wasn't >> >>>>> working >> >>>>> as expected. >> >>>>> >> >>>>> Some relevant facts: >> >>>>> >> >>>>> 1. There are about 3700 columns. Maybe this is the problem? Still >> >>>>> the >> >>>>> >> >>>>> file size is not very large. >> >>>>> 2. The file encoding is ANSI, but I'm not specifying that in the >> >>>>> >> >>>>> function. Setting fileEncoding="ANSI" produces an "unsupported >> >>>>> conversion" >> >>>>> error >> >>>>> 3. readLines imports the lines quickly >> >>>>> 4. scan imports the file quickly also >> >>>>> >> >>>>> >> >>>>> Obviously, scan and readLines would require more coding to identify >> >>>>> columns, etc. >> >>>>> >> >>>>> my code: >> >>>>> system.time(dat <- read.table('C:/test.txt', nrows=-1, sep='\t', >> >>>>> header=TRUE)) >> >>>>> >> >>>>> It's taking 33.4 seconds and the file size is only 315 kb! >> >>>>> >> >>>>> Thanks >> >>>>> >> >>>>> Gene >> >>>>> >> >>>>> [[alternative HTML version deleted]] >> >>>>> >> >>>>> ______________________________________________ >> >>>>> R-help@r-project.org mailing list >> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >> >>>>> PLEASE do read the posting guide >> >>>>> http://www.R-project.org/posting-guide.html >> >>>>> and provide commented, minimal, self-contained, reproducible code. >> >>>>> >> >>>> >> >>>> >> >>> >> >> >> > >> > [[alternative HTML version deleted]] >> > >> > ______________________________________________ >> > R-help@r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> >> -- >> Peter Dalgaard, Professor, >> Center for Statistics, Copenhagen Business School >> Solbjerg Plads 3, 2000 Frederiksberg, Denmark >> Phone: (+45)38153501 >> Email: pd....@cbs.dk Priv: pda...@gmail.com >> >> >> >> >> >> >> >> >> > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.