-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 08/12/11 09:32, Petr PIKAL wrote: > Hi > >> system.time(dat<-read.table("test2.txt")) > user system elapsed 32.38 0.00 32.40 > >> system.time(dat <- read.table('test2.txt', nrows=-1, sep='\t', > header=TRUE)) user system elapsed 32.30 0.03 32.36 > > Couldn't.it be a Windows issue?
Likely - here on Linux I get: > system.time(dat <- read.table('tmp/test2.txt', nrows=-1, sep='\t', header=TRUE)) user system elapsed 1.560 0.000 1.579 > sessionInfo() R version 2.14.0 (2011-10-31) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base > version _ platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status major 2 minor 14.0 year 2011 month 10 day 31 svn rev 57496 language R version.string R version 2.14.0 (2011-10-31) > Cheers, Rainer > _ platform i386-pc-mingw32 arch i386 os > mingw32 system i386, mingw32 status Under > development (unstable) major 2 minor 14.0 year > 2011 month 04 day 27 svn rev 55657 > language R version.string R version 2.14.0 Under development > (unstable) (2011-04-27 r55657) >> > > >> dim(dat) > [1] 7 3765 >> > > But from the dat file it seems to me that its structure is somehow > weird. > >> head(names(dat)) > [1] "X..Hydrogen" "Helium" "Lithium" "Beryllium" "Boron" > [6] "Carbon" >> tail(names(dat)) > [1] "Sulfur.32" "Chlorine.32" "Argon.32" "Potassium.32" > "Calcium.32" [6] "Scandium.32" >> > > There is row of names which has repeating values. Maybe the most > time is spent by checking the names validity. > > Regards Petr > > r-help-boun...@r-project.org napsal dne 07.12.2011 23:11:10: > >> peter dalgaard <pda...@gmail.com> Odeslal: >> r-help-boun...@r-project.org >> >> 07.12.2011 23:11 >> >> Komu >> >> "R. Michael Weylandt" <michael.weyla...@gmail.com> >> >> Kopie >> >> r-help@r-project.org, Gene Leynes <gley...@gmail.com> >> >> P?edm?t >> >> Re: [R] read.table performance >> >> >> On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote: >> >>> R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file >>> verbatim: system.time(read.table("test2.txt")) >> >> About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8. >> >> Gene, are you by any chance storing the file in a heavily >> virus-scanned system directory? >> >> -pd >> >>> Michael >>> >>> 2011/12/7 Gene Leynes <gley...@gmail.com>: >>>> Peter, >>>> >>>> You're quite right; it's nearly impossible to make progress >>>> without a working example. >>>> >>>> I created an ** extremely simplified ** example for >>>> distribution. The > real >>>> data has numeric, character, and boolean classes. >>>> >>>> The file still takes 25.08 seconds to read, despite it's >>>> small size. >>>> >>>> I neglected to mention that I'm using R 2.13.0 and I"m on a >>>> windows 7 machine (not that it should particularly matter >>>> with this type of > data / >>>> functions). >>>> >>>> ## The code: options(stringsAsFactors=FALSE) system.time(dat >>>> <- read.table('test2.txt', nrows=-1, sep='\t', > header=TRUE)) >>>> str(dat, 0) >>>> >>>> >>>> Thanks again! >>>> >>>> >>>> >>>> On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard >>>> <pda...@gmail.com> > wrote: >>>> >>>>> >>>>> On Dec 6, 2011, at 22:33 , Gene Leynes wrote: >>>>> >>>>>> Mark, >>>>>> >>>>>> Thanks for your suggestions. >>>>>> >>>>>> That's a good idea about the NULL columns; I didn't think >>>>>> of that. Surprisingly, it didn't have any effect on the >>>>>> time. >>>>> >>>>> Hmm, I think you want "character" and "NULL" there (i.e., >>>>> quoted). > Did you >>>>> fix both? >>>>> >>>>>>> read.table(whatever, as.is=TRUE, colClasses = >>>>>>> c(rep(character,4), rep(NULL,3696)). >>>>> >>>>> As a general matter, if you want people to dig into this, >>>>> they need > some >>>>> paraphrase of the file to play with. Would it be possible >>>>> to set up > a small >>>>> R program that generates a data file which displays the >>>>> issue? > Everything I >>>>> try seems to take about a second to read in. >>>>> >>>>> -pd >>>>> >>>>>> >>>>>> This problem was just a curiosity, I already did the >>>>>> import using > Excel >>>>> and >>>>>> VBA. I was just going to illustrate the power and >>>>>> simplicity of R, > but >>>>> it >>>>>> ironically it's been much slower and harder in R... The >>>>>> VBA was painful and messy, and took me over an hour to >>>>>> write; > but at >>>>>> least it worked quickly and reliably. The R code was >>>>>> clean and only took me about 5 minutes to write, but > the >>>>> run >>>>>> time was prohibitively slow! >>>>>> >>>>>> I profiled the code, but that offers little insight to >>>>>> me. >>>>>> >>>>>> Profile results with 10 line file: >>>>>> >>>>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out") >>>>>> $by.self self.time self.pct total.time total.pct scan >>>>>> 12.24 53.50 12.24 53.50 read.table >>>>>> 10.58 46.24 22.88 100.00 type.convert >>>>>> 0.04 0.17 0.04 0.17 make.names 0.02 >>>>>> 0.09 0.02 0.09 >>>>>> >>>>>> $by.total total.time total.pct self.time self.pct >>>>>> read.table 22.88 100.00 10.58 46.24 scan >>>>>> 12.24 53.50 12.24 53.50 type.convert >>>>>> 0.04 0.17 0.04 0.17 make.names 0.02 >>>>>> 0.09 0.02 0.09 >>>>>> >>>>>> $sample.interval [1] 0.02 >>>>>> >>>>>> $sampling.time [1] 22.88 >>>>>> >>>>>> >>>>>> Profile results with 250 line file: >>>>>> >>>>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out") >>>>>> $by.self self.time self.pct total.time total.pct scan >>>>>> 23.88 68.15 23.88 68.15 read.table >>>>>> 10.78 30.76 35.04 100.00 type.convert >>>>>> 0.30 0.86 0.32 0.91 character 0.02 >>>>>> 0.06 0.02 0.06 file 0.02 0.06 >>>>>> 0.02 0.06 lapply 0.02 0.06 0.02 >>>>>> 0.06 unlist 0.02 0.06 0.02 >>>>>> 0.06 >>>>>> >>>>>> $by.total total.time total.pct self.time self.pct >>>>>> read.table 35.04 100.00 10.78 30.76 >>>>>> scan 23.88 68.15 23.88 68.15 >>>>>> type.convert 0.32 0.91 0.30 0.86 >>>>>> sapply 0.04 0.11 0.00 0.00 >>>>>> character 0.02 0.06 0.02 0.06 >>>>>> file 0.02 0.06 0.02 0.06 >>>>>> lapply 0.02 0.06 0.02 0.06 >>>>>> unlist 0.02 0.06 0.02 0.06 >>>>>> simplify2array 0.02 0.06 0.00 0.00 >>>>>> >>>>>> $sample.interval [1] 0.02 >>>>>> >>>>>> $sampling.time [1] 35.04 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds >>>>>> <marklee...@gmail.com> > wrote: >>>>>> >>>>>>> hi gene: maybe someone else will reply with some >>>>>>> subtleties that > I'm >>>>> not >>>>>>> aware of. one other thing that might help: if you know >>>>>>> which columns you want , you can set > the >>>>>>> others to NULL through colClasses and this should speed >>>>>>> things up also. For example, say > you >>>>> knew >>>>>>> you only wanted the first four columns and they were >>>>>>> character. then you could do, >>>>>>> >>>>>>> read.table(whatever, as.is=TRUE, colClasses = >>>>>>> c(rep(character,4), rep(NULL,3696)). >>>>>>> >>>>>>> hopefully someone else will say something that does the >>>>>>> trick. it > seems >>>>>>> odd to me as far as the difference in timings ? good >>>>>>> luck. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes >>>>>>> <gley...@gmail.com> > wrote: >>>>>>> >>>>>>>> Mark, >>>>>>>> >>>>>>>> Thank you for the reply >>>>>>>> >>>>>>>> I neglected to mention that I had already set >>>>>>>> options(stringsAsFactors=FALSE) >>>>>>>> >>>>>>>> I agree, skipping the factor determination can help >>>>>>>> performance. >>>>>>>> >>>>>>>> The main reason that I wanted to use read.table is >>>>>>>> because it > will >>>>>>>> correctly determine the column classes for me. I >>>>>>>> don't really > want to >>>>>>>> specify 3700 column classes! (I'm not sure what they >>>>>>>> are > anyway). >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds > <marklee...@gmail.com> >>>>> wrote: >>>>>>>> >>>>>>>>> Hi Gene: Sometimes using colClasses in read.table >>>>>>>>> can speed > things up. >>>>>>>>> If you know what your variables are ahead of time >>>>>>>>> and what you > want >>>>> them to >>>>>>>>> be, this allows you to be specific by specifying, >>>>>>>>> character or >>>>> numeric, >>>>>>>>> etc and often it makes things faster. others will >>>>>>>>> have more to > say. >>>>>>>>> >>>>>>>>> also, if most of your variables are characters, R >>>>>>>>> will try to > turn >>>>>>>>> convert them into factors by default. If you use >>>>>>>>> as.is = TRUE it >>>>> won't >>>>>>>>> do this and that might speed things up also. >>>>>>>>> >>>>>>>>> >>>>>>>>> Rejoinder: above tidbits are just from >>>>>>>>> experience. I don't > know if >>>>>>>>> it's in stone or a hard and fast rule. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes >>>>>>>>> <gley...@gmail.com> >>>>> wrote: >>>>>>>>> >>>>>>>>>> ** Disclaimer: I'm looking for general >>>>>>>>>> suggestions ** I'm sorry, but can't send out the >>>>>>>>>> file I'm using, so there is > no >>>>>>>>>> reproducible example. >>>>>>>>>> >>>>>>>>>> I'm using read.table and it's taking over 30 >>>>>>>>>> seconds to read a > tiny >>>>>>>>>> file. The strange thing is that it takes roughly >>>>>>>>>> the same amount of > time if >>>>>>>>>> the file is 100 times larger. >>>>>>>>>> >>>>>>>>>> After re-reviewing the data Import / Export >>>>>>>>>> manual I think the > best >>>>>>>>>> approach would be to use Python, or perhaps the >>>>>>>>>> readLines > function, >>>>> but >>>>>>>>>> I was hoping to understand why the simple >>>>>>>>>> read.table approach > wasn't >>>>>>>>>> working as expected. >>>>>>>>>> >>>>>>>>>> Some relevant facts: >>>>>>>>>> >>>>>>>>>> 1. There are about 3700 columns. Maybe this is >>>>>>>>>> the problem? > Still >>>>>>>>>> the >>>>>>>>>> >>>>>>>>>> file size is not very large. 2. The file encoding >>>>>>>>>> is ANSI, but I'm not specifying that in > the >>>>>>>>>> >>>>>>>>>> function. Setting fileEncoding="ANSI" produces >>>>>>>>>> an > "unsupported >>>>>>>>>> conversion" error 3. readLines imports the lines >>>>>>>>>> quickly 4. scan imports the file quickly also >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Obviously, scan and readLines would require more >>>>>>>>>> coding to > identify >>>>>>>>>> columns, etc. >>>>>>>>>> >>>>>>>>>> my code: system.time(dat <- >>>>>>>>>> read.table('C:/test.txt', nrows=-1, > sep='\t', >>>>>>>>>> header=TRUE)) >>>>>>>>>> >>>>>>>>>> It's taking 33.4 seconds and the file size is >>>>>>>>>> only 315 kb! >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> Gene >>>>>>>>>> >>>>>>>>>> [[alternative HTML version deleted]] >>>>>>>>>> >>>>>>>>>> ______________________________________________ >>>>>>>>>> R-help@r-project.org mailing list >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>>>> PLEASE do read the posting guide >>>>>>>>>> http://www.R-project.org/posting-guide.html and >>>>>>>>>> provide commented, minimal, self-contained, >>>>>>>>>> reproducible > code. >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> [[alternative HTML version deleted]] >>>>>> >>>>>> ______________________________________________ >>>>>> R-help@r-project.org mailing list >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do >>>>>> read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>>> and provide commented, minimal, self-contained, >>>>>> reproducible code. >>>>> >>>>> -- Peter Dalgaard, Professor, Center for Statistics, >>>>> Copenhagen Business School Solbjerg Plads 3, 2000 >>>>> Frederiksberg, Denmark Phone: (+45)38153501 Email: >>>>> pd....@cbs.dk Priv: pda...@gmail.com >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read >>>> the posting guide > http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible >>>> code. >>>> >> >> -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen >> Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark >> Phone: (+45)38153501 Email: pd....@cbs.dk Priv: >> pda...@gmail.com >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the >> posting guide > http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible >> code. > > ______________________________________________ R-help@r-project.org > mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do > read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. - -- Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany) Centre of Excellence for Invasion Biology Stellenbosch University South Africa Tel : +33 - (0)9 53 10 27 44 Cell: +33 - (0)6 85 62 59 98 Fax : +33 - (0)9 58 10 27 44 Fax (D): +49 - (0)3 21 21 25 22 44 email: rai...@krugs.de Skype: RMkrug -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7gffsACgkQoYgNqgF2egpNpACeLbyAXB1pLGgyt7hAE7QAWe9i uV0An1Z8tvGw/1+40JM6YSe3aDqQoRkh =/mB7 -----END PGP SIGNATURE----- ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.