Hi > system.time(dat<-read.table("test2.txt")) user system elapsed 32.38 0.00 32.40
> system.time(dat <- read.table('test2.txt', nrows=-1, sep='\t', header=TRUE)) user system elapsed 32.30 0.03 32.36 Couldn't.it be a Windows issue? _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status Under development (unstable) major 2 minor 14.0 year 2011 month 04 day 27 svn rev 55657 language R version.string R version 2.14.0 Under development (unstable) (2011-04-27 r55657) > > dim(dat) [1] 7 3765 > But from the dat file it seems to me that its structure is somehow weird. > head(names(dat)) [1] "X..Hydrogen" "Helium" "Lithium" "Beryllium" "Boron" [6] "Carbon" > tail(names(dat)) [1] "Sulfur.32" "Chlorine.32" "Argon.32" "Potassium.32" "Calcium.32" [6] "Scandium.32" > There is row of names which has repeating values. Maybe the most time is spent by checking the names validity. Regards Petr r-help-boun...@r-project.org napsal dne 07.12.2011 23:11:10: > peter dalgaard <pda...@gmail.com> > Odeslal: r-help-boun...@r-project.org > > 07.12.2011 23:11 > > Komu > > "R. Michael Weylandt" <michael.weyla...@gmail.com> > > Kopie > > r-help@r-project.org, Gene Leynes <gley...@gmail.com> > > Předmět > > Re: [R] read.table performance > > > On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote: > > > R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file > > verbatim: system.time(read.table("test2.txt")) > > About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8. > > Gene, are you by any chance storing the file in a heavily virus-scanned > system directory? > > -pd > > > Michael > > > > 2011/12/7 Gene Leynes <gley...@gmail.com>: > >> Peter, > >> > >> You're quite right; it's nearly impossible to make progress without a > >> working example. > >> > >> I created an ** extremely simplified ** example for distribution. The real > >> data has numeric, character, and boolean classes. > >> > >> The file still takes 25.08 seconds to read, despite it's small size. > >> > >> I neglected to mention that I'm using R 2.13.0 and I"m on a windows 7 > >> machine (not that it should particularly matter with this type of data / > >> functions). > >> > >> ## The code: > >> options(stringsAsFactors=FALSE) > >> system.time(dat <- read.table('test2.txt', nrows=-1, sep='\t', header=TRUE)) > >> str(dat, 0) > >> > >> > >> Thanks again! > >> > >> > >> > >> On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard <pda...@gmail.com> wrote: > >> > >>> > >>> On Dec 6, 2011, at 22:33 , Gene Leynes wrote: > >>> > >>>> Mark, > >>>> > >>>> Thanks for your suggestions. > >>>> > >>>> That's a good idea about the NULL columns; I didn't think of that. > >>>> Surprisingly, it didn't have any effect on the time. > >>> > >>> Hmm, I think you want "character" and "NULL" there (i.e., quoted). Did you > >>> fix both? > >>> > >>>>> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), > >>>>> rep(NULL,3696)). > >>> > >>> As a general matter, if you want people to dig into this, they need some > >>> paraphrase of the file to play with. Would it be possible to set up a small > >>> R program that generates a data file which displays the issue? Everything I > >>> try seems to take about a second to read in. > >>> > >>> -pd > >>> > >>>> > >>>> This problem was just a curiosity, I already did the import using Excel > >>> and > >>>> VBA. I was just going to illustrate the power and simplicity of R, but > >>> it > >>>> ironically it's been much slower and harder in R... > >>>> The VBA was painful and messy, and took me over an hour to write; but at > >>>> least it worked quickly and reliably. > >>>> The R code was clean and only took me about 5 minutes to write, but the > >>> run > >>>> time was prohibitively slow! > >>>> > >>>> I profiled the code, but that offers little insight to me. > >>>> > >>>> Profile results with 10 line file: > >>>> > >>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out") > >>>> $by.self > >>>> self.time self.pct total.time total.pct > >>>> scan 12.24 53.50 12.24 53.50 > >>>> read.table 10.58 46.24 22.88 100.00 > >>>> type.convert 0.04 0.17 0.04 0.17 > >>>> make.names 0.02 0.09 0.02 0.09 > >>>> > >>>> $by.total > >>>> total.time total.pct self.time self.pct > >>>> read.table 22.88 100.00 10.58 46.24 > >>>> scan 12.24 53.50 12.24 53.50 > >>>> type.convert 0.04 0.17 0.04 0.17 > >>>> make.names 0.02 0.09 0.02 0.09 > >>>> > >>>> $sample.interval > >>>> [1] 0.02 > >>>> > >>>> $sampling.time > >>>> [1] 22.88 > >>>> > >>>> > >>>> Profile results with 250 line file: > >>>> > >>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out") > >>>> $by.self > >>>> self.time self.pct total.time total.pct > >>>> scan 23.88 68.15 23.88 68.15 > >>>> read.table 10.78 30.76 35.04 100.00 > >>>> type.convert 0.30 0.86 0.32 0.91 > >>>> character 0.02 0.06 0.02 0.06 > >>>> file 0.02 0.06 0.02 0.06 > >>>> lapply 0.02 0.06 0.02 0.06 > >>>> unlist 0.02 0.06 0.02 0.06 > >>>> > >>>> $by.total > >>>> total.time total.pct self.time self.pct > >>>> read.table 35.04 100.00 10.78 30.76 > >>>> scan 23.88 68.15 23.88 68.15 > >>>> type.convert 0.32 0.91 0.30 0.86 > >>>> sapply 0.04 0.11 0.00 0.00 > >>>> character 0.02 0.06 0.02 0.06 > >>>> file 0.02 0.06 0.02 0.06 > >>>> lapply 0.02 0.06 0.02 0.06 > >>>> unlist 0.02 0.06 0.02 0.06 > >>>> simplify2array 0.02 0.06 0.00 0.00 > >>>> > >>>> $sample.interval > >>>> [1] 0.02 > >>>> > >>>> $sampling.time > >>>> [1] 35.04 > >>>> > >>>> > >>>> > >>>> > >>>> On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds <marklee...@gmail.com> wrote: > >>>> > >>>>> hi gene: maybe someone else will reply with some subtleties that I'm > >>> not > >>>>> aware of. one other thing > >>>>> that might help: if you know which columns you want , you can set the > >>>>> others to NULL through > >>>>> colClasses and this should speed things up also. For example, say you > >>> knew > >>>>> you only wanted the > >>>>> first four columns and they were character. then you could do, > >>>>> > >>>>> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), > >>>>> rep(NULL,3696)). > >>>>> > >>>>> hopefully someone else will say something that does the trick. it seems > >>>>> odd to me as far as the > >>>>> difference in timings ? good luck. > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes <gley...@gmail.com> wrote: > >>>>> > >>>>>> Mark, > >>>>>> > >>>>>> Thank you for the reply > >>>>>> > >>>>>> I neglected to mention that I had already set > >>>>>> options(stringsAsFactors=FALSE) > >>>>>> > >>>>>> I agree, skipping the factor determination can help performance. > >>>>>> > >>>>>> The main reason that I wanted to use read.table is because it will > >>>>>> correctly determine the column classes for me. I don't really want to > >>>>>> specify 3700 column classes! (I'm not sure what they are anyway). > >>>>>> > >>>>>> > >>>>>> On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds <marklee...@gmail.com> > >>> wrote: > >>>>>> > >>>>>>> Hi Gene: Sometimes using colClasses in read.table can speed things up. > >>>>>>> If you know what your variables are ahead of time and what you want > >>> them to > >>>>>>> be, this allows you to be specific by specifying, character or > >>> numeric, > >>>>>>> etc and often it makes things faster. others will have more to say. > >>>>>>> > >>>>>>> also, if most of your variables are characters, R will try to turn > >>>>>>> convert them into factors by default. If you use as.is = TRUE it > >>> won't > >>>>>>> do this and that might speed things up also. > >>>>>>> > >>>>>>> > >>>>>>> Rejoinder: above tidbits are just from experience. I don't know if > >>>>>>> it's in stone or a hard and fast rule. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes <gley...@gmail.com> > >>> wrote: > >>>>>>> > >>>>>>>> ** Disclaimer: I'm looking for general suggestions ** > >>>>>>>> I'm sorry, but can't send out the file I'm using, so there is no > >>>>>>>> reproducible example. > >>>>>>>> > >>>>>>>> I'm using read.table and it's taking over 30 seconds to read a tiny > >>>>>>>> file. > >>>>>>>> The strange thing is that it takes roughly the same amount of time if > >>>>>>>> the > >>>>>>>> file is 100 times larger. > >>>>>>>> > >>>>>>>> After re-reviewing the data Import / Export manual I think the best > >>>>>>>> approach would be to use Python, or perhaps the readLines function, > >>> but > >>>>>>>> I > >>>>>>>> was hoping to understand why the simple read.table approach wasn't > >>>>>>>> working > >>>>>>>> as expected. > >>>>>>>> > >>>>>>>> Some relevant facts: > >>>>>>>> > >>>>>>>> 1. There are about 3700 columns. Maybe this is the problem? Still > >>>>>>>> the > >>>>>>>> > >>>>>>>> file size is not very large. > >>>>>>>> 2. The file encoding is ANSI, but I'm not specifying that in the > >>>>>>>> > >>>>>>>> function. Setting fileEncoding="ANSI" produces an "unsupported > >>>>>>>> conversion" > >>>>>>>> error > >>>>>>>> 3. readLines imports the lines quickly > >>>>>>>> 4. scan imports the file quickly also > >>>>>>>> > >>>>>>>> > >>>>>>>> Obviously, scan and readLines would require more coding to identify > >>>>>>>> columns, etc. > >>>>>>>> > >>>>>>>> my code: > >>>>>>>> system.time(dat <- read.table('C:/test.txt', nrows=-1, sep='\t', > >>>>>>>> header=TRUE)) > >>>>>>>> > >>>>>>>> It's taking 33.4 seconds and the file size is only 315 kb! > >>>>>>>> > >>>>>>>> Thanks > >>>>>>>> > >>>>>>>> Gene > >>>>>>>> > >>>>>>>> [[alternative HTML version deleted]] > >>>>>>>> > >>>>>>>> ______________________________________________ > >>>>>>>> R-help@r-project.org mailing list > >>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>> PLEASE do read the posting guide > >>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>> and provide commented, minimal, self-contained, reproducible code. > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>>> [[alternative HTML version deleted]] > >>>> > >>>> ______________________________________________ > >>>> R-help@r-project.org mailing list > >>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>> PLEASE do read the posting guide > >>> http://www.R-project.org/posting-guide.html > >>>> and provide commented, minimal, self-contained, reproducible code. > >>> > >>> -- > >>> Peter Dalgaard, Professor, > >>> Center for Statistics, Copenhagen Business School > >>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark > >>> Phone: (+45)38153501 > >>> Email: pd....@cbs.dk Priv: pda...@gmail.com > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >> > >> ______________________________________________ > >> R-help@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > >> > > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Email: pd....@cbs.dk Priv: pda...@gmail.com > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.