Resending to correct bad subject line...

On Mon, 20 Jun 2011, Jim Silverton wrote:

I a using plink on a large SNP dataset with a .map and .ped file. I want to get some sort of file say a list of all the SNPs that plink is saying that I have. ANyideas on how to do this?


All the SNPs you have are listed in the .map file. An easy way to put the data in to R, if there isn't too much, is to do this:

plink --file whatever --out whatever --recodeA

That will make a file called whatever.raw, single space delimited, consisting of minor allele counts (0, 1, 2, NA) that you can bring into R like this:

data <- read.table("whatever.raw", delim=" ", header=T)

If you have tons of data, you'll want to work with the compact binary format (four genotypes per byte):

plink --file whatever --out whatever --make-bed

Then see David Duffy's reply. However, I'm not sure if R can work with the compact format in memory. It might expand those genotypes (minor allele counts) from two-bit integers to double-precision floats. What does read.plink() create in memory?

There is another package I've been meaning to look at that is supposed to help with the memory management problem for large genotype files:

http://cran.r-project.org/web/packages/ff/

I haven't used it yet, but I am hopeful. Maybe David Duffy or someone else here will know more about it.

If you have a lot of data, also consider chopping the data into pieces before loading it into R. That's what we do. With a 100 core system, I break the data into 100 files (I use the GNU/Linux "split" command and a few other tricks) and have all 100 cores run at once to analyze the data.

When I work with genotype data as allele counts using Octave, I store the data, both in files and in memory, as unsigned 8-bit integers, using 3 as the missing value. That's still inefficient compared to the PLINK system, but it is way better than using doubles.

Best,
Mike

--
Michael B. Miller, Ph.D.
Minnesota Center for Twin and Family Research
Department of Psychology
University of Minnesota

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to