Farrel Buchinsky <[EMAIL PROTECTED]> wrote: > Bottom Line Up Front: How does one reshape genetic data from long to wide?
I avoid both your "long" and "wide" formats because they are awkward and inefficient for large data sets. The "long" format wastes a huge amount of space with redundant column values, and manipulating a data frame with 12000 columns is not much fun either. Instead, I pack genotypes into strings: usually one string per SNP. So in your example, I'd have a small 180-row table with a few columns of data about the samples, and a small 6000-row SNP table with one column of packed genotypes. The sample table is organized so the row order matches the positions of the sample genotypes in the packed genotype strings. If I want genotypes for a particular SNP, I unpack them with strsplit(), tack them onto the sample table as a new column, and discard when I'm done. I store genotype data in our database this way as well. We also have a "long" format table, but I avoid it whenever possible because the packed format is so much more convenient. The time saved pulling data out of the database in this format dwarfs the time spent parsing out the genotype strings. -- David Hinds ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html