Here's the essence of a solution (0.14 sec for this bit) res <- with(betterTest, { subjects <- levels(subject) loci <- levels(locus) ## replace "" by as.character(NA) if you prefer res <- matrix("", length(subjects), length(loci), dimnames = list(subjects, loci)) ind <- cbind(as.integer(subject), as.integer(locus)) res[ind] <- as.character(genotype) res })
This produces a character matrix, mainly because that is what I had before. However, I think a matrix is probably a better data structure than a wide dataframe, but it can easily be converted (which will take a little longer, 1.5s if you use as.data.frame but there are much faster ways). It is certainly possible to do the same thing with a factor matrix result, but there seems to be a problem with the subsetting method for such objects. I would suggest generally avoiding data frames where all the columns are of one type and you are concerned with efficiency. On Thu, 24 Aug 2006, Mitch Skinner wrote: > I'd like to thank everyone that's replied so far--more inline: > > On Thu, 2006-08-24 at 11:16 +0100, Prof Brian Ripley wrote: > > Your example does not correspond to your description. You have taken a > > random number of loci for each subject and measured each a random number > > of times: > > You're right. I was trying to come up with an example that didn't > require sending out a big hunk of data. The overall number of > rows/columns and the data types/sizes in the example were true to life > but the relationship between columns was not. Also, in my testing the > run time of the random example was pretty close to (actually faster > than) the run time on my real data. > > In the real data, there's about one row per subject/locus pair (some > combinations are missing). The genotype data does have character type; > I'd have to think a bit to see if I could make it into an integer > vector. Aside from just making it a factor, of course. > > Thanks to Gabor Grothendieck for demonstrating gl(): > > > betterTest=data.frame(subject=as.character(1:70), > locus=as.character(gl(4500, 70)), > genotype=as.character(as.integer(runif(4500*70, 1, 20)))) > > sapply(betterTest, is.factor) > subject locus genotype > TRUE TRUE TRUE > > system.time(wideTest <- reshape(betterTest, v.names="genotype", > timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE) > [1] 1356.209 178.867 2071.640 0.000 0.000 > > dim(wideTest) > [1] 70 4501 > > dim(betterTest) > [1] 315000 3 > > This was on a different machine (a 2.2 Ghz Athlon 64). The only > difference I can think of between betterTest and my actual data is that > betterTest is ordered. > > > Also, subject and locus are archetypal factors, and forcing them to be > > character vectors is just making efficiency problems for yourself. > > Hmmmm, that's the way they're coming out of the database. I'm using > RdbiPgSQL from Bioconductor, and I assumed there was a reason why the > database interface wasn't turning things into factors. Given my (low) > level of R knowledge, I'd have to think for a while to convince myself > that doing so wouldn't make a difference aside from being faster. Of > course, if you're asserting that that's the case I'll take your word for > it. > > > I have an R-level solution that takes 0.2 s on my machine, and involves no > > changes to R. > > > > However, you did not give your affiliation and I do not like giving free > > consultancy to undisclosed commercial organizations. Please in future use > > a proper signature block so that helpers are aware of your provenance. > > Ah, I hadn't really thought about this, but I see where you're coming > from. I work here (my name and this email address are on the page): > http://egcrc.org/pis/white-c.htm > Please forgive my r-devel-newbieness; this is less of an issue on the > other mailing lists I follow. > > When there's a chance (however slim, in this case) that something I > write will end up getting used by someone else, I usually use my > personal email address and general identity, because I know it'll follow > me if I change jobs. The concern, of course, being that someone using > it will want to get in touch with me sometime in the far future. I > don't exactly have a tenured position. > > I really am trying to give at least as much as I'm taking; hopefully my > first email shows that I did a healthy bit of > thinking/reading/googling/coding before posting (maybe too much). > Apparently the c-solution isn't necessary, but doing this in 0.2s is > pretty amazing. On the same size data frame? > > Thanks, > Mitch Skinner Tel: 510-985-3192 > Programmer/Analyst > Ernest Gallo Clinic & Research Center > University of California, San Francisco > -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel