Hi Suganthi, You're right -- these were not corrected for strand, and the schema description is incorrect. I will revisit the HGDP data files and see if there's a way to identify and fix these cases.
An expedient possibility, if you are adept at Perl or some other programming language: read in genome fasta sequence, discard newlines etc, and store into large strings (possibly one chrom at a time, since input is sorted; read chrom seq each time a new chrom appears). For each SNP, look up the reference base at the given coord (substr). If neither allele matches the ref, but the ref does match a complemented allele (I guess it had better), replace the given allele with the complemented allele. This would leave some ambiguity of ancestral vs derived if the dataset included C/G or A/T SNPs, but it doesn't (by design of the Illumina assay, pers. comm. Devin Absher). Sorry for the inconvenience, Angie ----- "Suganthi Bala" <[email protected]> wrote: > From: "Suganthi Bala" <[email protected]> > To: [email protected] > Sent: Tuesday, September 14, 2010 7:50:28 PM GMT -08:00 US/Canada Pacific > Subject: [Genome] HGDP SNP data > > Hi, > > This pertains to the data that I downloaded for HGDP SNPs via the Table > Browser for HG18 build. It appears that the SNPs are not always reported > with respect to the forward strand of the reference genome even though that > is what the table schema indicates. For eg, the following SNPs: rs2296441, > rs12782963, rs4758443 etc. > > Is it possible that it was mistakenly not corrected for strand orientation? > If yes, is it possible to get a fixed file quickly? Thanks. > > Best, > Suganthi Bala > Yale University > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
