Juliet, I've been corrected off list. I did not read properly that you are on 64bit.
The calculation should be : 53860858 * 4 * 8 /1024^3 = 1.6GB since pointers are 8 bytes on 64bit. Also, data.table is an add-on package so I should have included : install.packages("data.table") require(data.table) data.table is available on all platforms both 32bit and 64bit. Please forgive mistakes: 'someoone' should be 'someone', 'percieved' should be 'perceived' and 'testDate' should be 'testData' at the end. The rest still applies, and you might have a much easier time than I thought since you are on 64bit. I was working on the basis of squeezing into 32bit. Matthew "Matthew Dowle" <mdo...@mdowle.plus.com> wrote in message news:i1faj2$lv...@dough.gmane.org... > > Hi Juliet, > > Thanks for the info. > > It is very slow because of the == in testData[testData$V2==one_ind,] > > Why? Imagine someoone looks for 10 people in the phone directory. Would > they search the entire phone directory for the first person's phone > number, starting > on page 1, looking at every single name, even continuing to the end of the > book > after they had found them ? Then would they start again from page 1 for > the 2nd > person, and then the 3rd, searching the entire phone directory from start > to finish > for each and every person ? That code using == does that. Some of us > call > that a 'vector scan' and is a common reason for R being percieved as slow. > > To do that more efficiently try this : > > testData = as.data.table(testData) > setkey(testData,V2) # sorts data by V2 > for (one_ind in mysamples) { > one_sample <- testData[one_id,] > reshape(one_sample) > } > > or just this : > > testData = as.data.table(testData) > setkey(testDate,V2) > testData[,reshape(.SD,...), by=V2] > > That should solve the vector scanning problem, and get you on to the > memory > problems which will need to be tackled. Since the 4 columns are character, > then > the object size should be roughly : > > 53860858 * 4 * 4 /1024^3 = 0.8GB > > That is more promising to work with in 32bit so there is hope. [ That > 0.8GB > ignores the (likely small) size of the unique strings in global string > hash (depending > on your data). ] > > Its likely that the as.data.table() fails with out of memory. That is not > data.table > but unique. There is a change in unique.c in R 2.12 which makes unique > more > efficient and since factor calls unique, it may be necessary to use R > 2.12. > > If that still doesn't work, then there are several more tricks (and we > will need > further information), and there may be some tweaks needed to that code as > I > didn't test it, but I think it should be possible in 32bit using R 2.12. > > Is it an option to just keep it in long format and use a data.table ? > > testDate[, somecomplexrfunction(onecolumn, anothercolumn), by=list(V2) ] > > Why you you need to reshape from long to wide ? > > HTH, > Matthew > > > > "Juliet Hannah" <juliet.han...@gmail.com> wrote in message > news:aanlktinyvgmrvdp0svc-fylgogn2ro0omnugqbxx_...@mail.gmail.com... > Hi Jim, > > Thanks for responding. Here is the info I should have included before. > I should be able to access 4 GB. > >> str(myData) > 'data.frame': 53860857 obs. of 4 variables: > $ V1: chr "200003" "200006" "200047" "200050" ... > $ V2: chr "cv0001" "cv0001" "cv0001" "cv0001" ... > $ V3: chr "A" "A" "A" "B" ... > $ V4: chr "B" "B" "A" "B" ... >> sessionInfo() > R version 2.11.0 (2010-04-22) > x86_64-unknown-linux-gnu > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > On Mon, Jul 12, 2010 at 7:54 AM, jim holtman <jholt...@gmail.com> wrote: >> What is the configuration you are running on (OS, memory, etc.)? What >> does your object consist of? Is it numeric, factors, etc.? Provide a >> 'str' of it. If it is numeric, then the size of the object is >> probably about 1.8GB. Doing the long to wide you will probably need >> at least that much additional memory to hold the copy, if not more. >> This would be impossible on a 32-bit version of R. >> >> On Mon, Jul 12, 2010 at 1:25 AM, Juliet Hannah <juliet.han...@gmail.com> >> wrote: >>> I have a data set that has 4 columns and 53860858 rows. I was able to >>> read this into R with: >>> >>> cc <- rep("character",4) >>> myData <- >>> read.table("myData.csv",header=FALSE,skip=1,colClasses=cc,nrow=53860858,sep=",") >>> >>> >>> I need to reshape this data from long to wide. On a small data set the >>> following lines work. But on the real data set, it didn't finish even >>> when I took a sample of two (rows in new data). I didn't receive an >>> error. I just stopped it because it was taking too long. Any >>> suggestions for improvements? Thanks. >>> >>> # start example >>> # i have commented out the write.table statement below >>> >>> testData <- read.table(textConnection("rs9999853,cv0084,A,A >>> rs999986,cv0084,C,B >>> rs9999883,cv0084,E,F >>> rs9999853,cv0085,G,H >>> rs999986,cv0085,I,J >>> rs9999883,cv0085,K,L"),header=FALSE,sep=",") >>> closeAllConnections() >>> >>> mysamples <- unique(testData$V2) >>> >>> for (one_ind in mysamples) { >>> one_sample <- testData[testData$V2==one_ind,] >>> mywide <- reshape(one_sample, timevar = "V1", idvar = >>> "V2",direction = "wide") >>> # write.table(mywide,file >>> ="newdata.txt",append=TRUE,row.names=FALSE,col.names=FALSE,quote=FALSE) >>> } >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> >> >> -- >> Jim Holtman >> Cincinnati, OH >> +1 513 646 9390 >> >> What is the problem that you are trying to solve? >> > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.