Also instead of 'splitting' the data frame, I split the indices and then use those to access the information in the original dataframe.
On Tue, Dec 8, 2009 at 9:54 PM, Mark Kimpel <mwkim...@gmail.com> wrote: > Hadley, Just as you were apparently writing I had the same thought and did > exactly what you suggested, converting all columns except the one that I > want split to character. Executed almost instantaneously without problem. > Thanks! Mark > > Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry > Indiana University School of Medicine > > 15032 Hunter Court, Westfield, IN 46074 > > (317) 490-5129 Work, & Mobile & VoiceMail > (317) 399-1219 Skype No Voicemail please > > > On Tue, Dec 8, 2009 at 10:48 PM, hadley wickham <h.wick...@gmail.com> > wrote: > > > Hi Mark, > > > > Why are you using factors? I think for this case you might find > > characters are faster and more space efficient. > > > > Alternatively, you can have a look at the plyr package which uses some > > tricks to keep memory usage down. > > > > Hadley > > > > On Tue, Dec 8, 2009 at 9:46 PM, Mark Kimpel <mwkim...@gmail.com> wrote: > > > Charles, I suspect your are correct regarding copying of the > attributes. > > > First off, selectSubAct.df is my "real" data, which turns out to be of > > the > > > same dim() as myDataFrame below, but each column is make up of strings, > > not > > > simple letters, and there are many levels in each column, which I did > not > > > properly duplicate in my first example. I have ammended that below and > > with > > > the split the new object size is now not 10X the size of the original, > > but > > > 100X. My "real" data is even more complex than this, so I suspect that > is > > > where the problem lies. I need to search for a better solution to my > > problem > > > than split, for which I will start a separate thread if I can't figure > > > something out. > > > > > > Thanks for pointing me in the right direction, > > > > > > Mark > > > > > > myDataFrame <- data.frame(matrix(paste("The rain in Spain", > > > as.character(1:1400), sep = "."), ncol = 7, nrow = 399000)) > > > mySplitVar <- factor(paste("Rainy days and Mondays", > > as.character(1:1400), > > > sep = ".")) > > > myDataFrame <- cbind(myDataFrame, mySplitVar) > > > object.size(myDataFrame) > > > ## 12860880 bytes # ~ 13MB > > > myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar) > > > object.size(myDataFrame.split) > > > ## 1,274,929,792 bytes ~ 1.2GB > > > object.size(selectSubAct.df) > > > ## 52,348,272 bytes # ~ 52MB > > > Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry > > > Indiana University School of Medicine > > > > > > 15032 Hunter Court, Westfield, IN 46074 > > > > > > (317) 490-5129 Work, & Mobile & VoiceMail > > > (317) 399-1219 Skype No Voicemail please > > > > > > > > > On Tue, Dec 8, 2009 at 10:22 PM, Charles C. Berry < > cbe...@tajo.ucsd.edu > > >wrote: > > > > > >> On Tue, 8 Dec 2009, Mark Kimpel wrote: > > >> > > >> I'm having trouble using split on a very large data-set with ~1400 > > levels > > >>> of > > >>> the factor to be split. Unfortunately, I can't reproduce it with the > > >>> simple > > >>> self-contained example below. As you can see, splitting the > artificial > > >>> dataframe of size ~13MB results in a split dataframe of ~ 144MB, with > > an > > >>> increase memory allocation of ~10 fold for the split object. If split > > >>> scales > > >>> linearly, then my actual 52MB dataframe should be easily handled by > my > > >>> 12GB > > >>> of RAM, but it is not. instead, when I try to split selectSubAct.df > on > > one > > >>> of its factors with 1473 levels, my memory is slowly gobbled up (plus > 3 > > GB > > >>> of swap) until I cancel the operation. > > >>> > > >>> Any ideas on what might be happening? Thanks, Mark > > >>> > > >> > > >> Each element of myDataFrame.split contains a copy of the attributes of > > the > > >> parent data.frame. > > >> > > >> And probably it does scale linearly. But the scaling factor depends on > > the > > >> size of the attributes that get copied, I guess. > > >> > > >> > > >> > > >> > > >>> myDataFrame <- data.frame(matrix(LETTERS, ncol = 7, nrow = 399000)) > > >>> mySplitVar <- factor(as.character(1:1400)) > > >>> myDataFrame <- cbind(myDataFrame, mySplitVar) > > >>> object.size(myDataFrame) > > >>> ## 12860880 bytes # ~ 13MB > > >>> myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar) > > >>> object.size(myDataFrame.split) > > >>> ## 144524992 bytes # ~ 144MB > > >>> > > >> > > >> Note: > > >> > > >> only.attr <- lapply(myDataFrame.split,function(x) > sapply(x,attributes)) > > >>> > > >>> > > > (object.size(myDataFrame.split)-object.size(myDataFrame))/object.size(only.attr) > > >>> > > >> 1.03726179240978 bytes > > >> > > >> > > >>> > > >> > > >> object.size(selectSubAct.df) > > >>> ## 52,348,272 bytes # ~ 52MB > > >>> > > >> > > >> What was this?? > > >> > > >> > > >> Chuck > > >> > > >> > > >>> sessionInfo() > > >>>> > > >>> R version 2.10.0 Patched (2009-10-27 r50222) > > >>> x86_64-unknown-linux-gnu > > >>> > > >>> locale: > > >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > > >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > > >>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > > >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > > >>> [9] LC_ADDRESS=C LC_TELEPHONE=C > > >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > >>> > > >>> attached base packages: > > >>> [1] stats graphics grDevices datasets utils methods base > > >>> > > >>> loaded via a namespace (and not attached): > > >>> [1] tools_2.10.0 > > >>> > > >>> Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry > > >>> Indiana University School of Medicine > > >>> > > >>> 15032 Hunter Court, Westfield, IN 46074 > > >>> > > >>> (317) 490-5129 Work, & Mobile & VoiceMail > > >>> (317) 399-1219 Skype No Voicemail please > > >>> > > >>> [[alternative HTML version deleted]] > > >>> > > >>> > > >>> ______________________________________________ > > >>> R-help@r-project.org mailing list > > >>> https://stat.ethz.ch/mailman/listinfo/r-help > > >>> PLEASE do read the posting guide > > >>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > > >>> and provide commented, minimal, self-contained, reproducible code. > > >>> > > >>> > > >> Charles C. Berry (858) 534-2098 > > >> Dept of Family/Preventive > > >> Medicine > > >> E mailto:cbe...@tajo.ucsd.edu UC San Diego > > >> http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego > > 92093-0901 > > >> > > >> > > >> > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help@r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > > > and provide commented, minimal, self-contained, reproducible code. > > > > > > > > > > > -- > > http://had.co.nz/ > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.