I think I might be able to answer my own question here. It turns out in the step
tab <- table(dats3$student_unique_id, dats3$teacher_unique_id) The dimensions of this table in my data are significantly larger than the simulated data, thus consuming more memory. So, I know that the lme4 package has a method for sparse crosstabs, so I tried this: > library(lme4) > tab <- xtabs(~ dats3$student_unique_id + dats3$teacher_unique_id, sparse = TRUE) > result <- data.frame(Student = rownames(tab), Freq = rowSums(tab), tch = rowSums(tab > 0) == 1) And the world works beautifully. > -----Original Message----- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Doran, Harold > Sent: Monday, March 09, 2009 10:25 AM > To: ONKELINX, Thierry; jholt...@gmail.com > Cc: r-help@r-project.org > Subject: Re: [R] Making tapply code more efficient > > Thierry and Jim: > > Thank you both for your reply. I remain a bit baffled over something. > Here is the sample data generated by jim and code by Thierry, > which works exactly as expected. > > x <- cbind(sample(326397, 800967, TRUE), sample(20, 800967, > TRUE)) x <- data.frame(x) names(x)[1:2] <- > c('student_unique_id', 'teacher_unique_id') tab <- > table(x$student_unique_id, x$teacher_unique_id) result <- > data.frame(Student = rownames(tab), Freq = rowSums(tab), tch > = rowSums(tab > 0) == 1) > > Now, here is what happens when I run this on my data (called dats3) > > > tab <- table(dats3$student_unique_id, dats3$teacher_unique_id) > Error: cannot allocate vector of size 942.8 Mb > > So, let's take a look at a couple of things: > > > object.size(dats3) < object.size(x) > [1] TRUE > > > str(x) > 'data.frame': 800967 obs. of 2 variables: > $ student_unique_id: int 121914 89142 127790 61350 54684 > 28018 313428 > 27595 316285 173571 ... > $ teacher_unique_id: int 17 1 19 20 3 18 15 1 14 15 ... > > str(dats3) > 'data.frame': 56204 obs. of 2 variables: > $ student_unique_id: int 20504 26172 20504 3609 4313 5058 5363 5669 > 6429 6560 ... > $ teacher_unique_id: int 35078 41029 35078 41437 41476 41456 41486 > 35415 41508 35413 ... > > The sample data are smaller in size than my actual data and > the structure is exactly the same. Do you see any other > reason why the memory issue would arise here? > > Harold > > > > > > > -----Original Message----- > > From: ONKELINX, Thierry [mailto:thierry.onkel...@inbo.be] > > Sent: Friday, February 27, 2009 10:24 AM > > To: Doran, Harold; r-help@r-project.org > > Subject: RE: [R] Making tapply code more efficient > > > > Hi Harold, > > > > What about this? You one have to make the crosstabulation once. > > > > > qq <- data.frame(student = factor(c(1,1,2,2,2)), teacher = > > factor(c(10,10,20,20,25))) > > > tab <- table(qq$student, qq$teacher) data.frame(Student = > > > rownames(tab), Freq = rowSums(tab), tch = > > rowSums(tab > 0) == 1) > > Student Freq tch > > 1 1 2 TRUE > > 2 2 3 FALSE > > > > HTH, > > > > Thierry > > > > > > -------------------------------------------------------------- > > ---------- > > ---- > > ir. Thierry Onkelinx > > Instituut voor natuur- en bosonderzoek / Research Institute > for Nature > > and Forest Cel biometrie, methodologie en kwaliteitszorg / Section > > biometrics, methodology and quality assurance Gaverstraat 4 9500 > > Geraardsbergen Belgium tel. + 32 > > 54/436 185 thierry.onkel...@inbo.be www.inbo.be > > > > To call in the statistician after the experiment is done may be no > > more than asking him to perform a post-mortem > > examination: he may be able to say what the experiment died of. > > ~ Sir Ronald Aylmer Fisher > > > > The plural of anecdote is not data. > > ~ Roger Brinner > > > > The combination of some data and an aching desire for an > answer does > > not ensure that a reasonable answer can be extracted from a > given body > > of data. > > ~ John Tukey > > > > -----Oorspronkelijk bericht----- > > Van: r-help-boun...@r-project.org > > [mailto:r-help-boun...@r-project.org] > > Namens Doran, Harold > > Verzonden: vrijdag 27 februari 2009 15:47 > > Aan: r-help@r-project.org > > Onderwerp: [R] Making tapply code more efficient > > > > Previously, I posed the question pasted down below to the list and > > received some very helpful responses. While the code suggestions > > provided in response indeed work, they seem to only work > with *very* > > small data sets and so I wanted to follow up and see if anyone had > > ideas for better efficiency. > > I was quite embarrased on this as our SAS programmers cranked out > > programs that did this in the blink of an eye (with a few > variables), > > but R was spinning for days on my Ubuntu machine and > ultimately I saw > > a message that R was "killed". > > > > The data I am working with has 800967 total rows and 31 > total columns. > > The ID variable I use as the index variable in tapply() has > > 326397 unique cases. > > > > > length(unique(qq$student_unique_id)) > > [1] 326397 > > > > To give a sense of what my data look like and the actual problem, > > consider the following: > > > > qq <- data.frame(student_unique_id = factor(c(1,1,2,2,2)), > > teacher_unique_id = factor(c(10,10,20,20,25))) > > > > This is a student achievement database where students > occupy multiple > > rows in the data and the variable teacher_unique_id denotes > the class > > the student was in. What I am doing is looking to see if > the teacher > > is the same for each instance of the unique student ID. So, if I > > implement the following: > > > > same <- function(x) length( unique(x) ) == 1 results <- data.frame( > > freq = tapply(qq$student_unique_id, qq$student_unique_id, > > length), > > tch = tapply(qq$teacher_unique_id, > qq$student_unique_id, same) > > ) > > > > I get the following results. I can see that student 1 > appears in the > > data twice and the teacher is always the same. > > However, student 2 appears three times and the teacher is > not always > > the same. > > > > > results > > freq tch > > 1 2 TRUE > > 2 3 FALSE > > > > Now, implementing this same procedure to a large data set with the > > characteristics described above seems to be problematic in this > > implementation. > > > > Does anyone have reactions on how this could be more efficient such > > that it can run with large data as I described? > > > > Harold > > > > > sessionInfo() > > R version 2.8.1 (2008-12-22) > > x86_64-pc-linux-gnu > > > > locale: > > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLA > > TE=en_US.U > > TF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF- > > 8;LC_NAME= > > C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_ID > > ENTIFICATI > > ON=C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > > > > > > > ##### Original question posted on 1/13/09 Suppose I have a > dataframe > > as follows: > > > > dat <- data.frame(id = c(1,1,2,2,2), var1 = > c(10,10,20,20,25), var2 = > > c('foo', 'foo', 'foo', 'foobar', 'foo')) > > > > Now, if I were to subset by id, such as: > > > > > subset(dat, id==1) > > id var1 var2 > > 1 1 10 foo > > 2 1 10 foo > > > > I can see that the elements in var1 are exactly the same and the > > elements in var2 are exactly the same. However, > > > > > subset(dat, id==2) > > id var1 var2 > > 3 2 20 foo > > 4 2 20 foobar > > 5 2 25 foo > > > > Shows the elements are not the same for either variable in this > > instance. So, what I am looking to create is a data frame > that would > > be like this > > > > id freq var1 var2 > > 1 2 TRUE TRUE > > 2 3 FALSE FALSE > > > > Where freq is the number of times the ID is repeated in the > dataframe. > > A TRUE appears in the cell if all elements in the column > are the same > > for the ID and FALSE otherwise. It is insignificant which values > > differ for my problem. > > > > The way I am thinking about tackling this is to loop through the ID > > variable and compare the values in the various columns of the > > dataframe. > > The problem I am encountering is that I don't think all.equal or > > identical are the right functions in this case. > > > > So, say I was wanting to compare the elements of var1 for id ==1. I > > would have > > > > x <- c(10,10) > > > > Of course, the following works > > > > > all.equal(x[1], x[2]) > > [1] TRUE > > > > As would a similar call to identical. However, what if I > only have a > > vector of values (or if the column consists of names) that > I want to > > assess for equality when I am trying to automate a process over > > thousands of cases? As in the example above, the vector may contain > > only two values or it may contain many more. The number of > values in > > the vector differ by id. > > > > Any thoughts? > > > > Harold > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > Dit bericht en eventuele bijlagen geven enkel de visie van de > > schrijver weer en binden het INBO onder geen enkel beding, > zolang dit > > bericht niet bevestigd is door een geldig ondertekend document. The > > views expressed in this message and any annex are purely > those of the > > writer and may not be regarded as stating an official position of > > INBO, as long as the message is not confirmed by a duly signed > > document. > > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.