On Sun, 31 Dec 2006, Charles C. Berry wrote: > On Sun, 31 Dec 2006, Farrel Buchinsky wrote: > >> I have hundreds of humans who have undergone SNP genotyping at hundreds of >> loci. Some have even undergone the procedure twice or thrice (kind of an >> internal control). >> >> So obviously I need to find those replications, and confirm that the >> results >> are the same. If there is discordance then I need to address it. > > Why not use duplicated() ?
More specifically: unique( IDs[ duplicated( IDs ) & ! duplicated ( cbind (IDs, SNPs ) ) ] ) gives a list of those IDs for which the SNPs in all replicates of an ID are not the same. > > For a data.frame with 200 rows of which about 50 are duplicates and 201 > columns finding the (non) duplicates takes little time on my year old AMD 64 > running Windows XP: > >> my.dat <- data.frame(ID=rep(1:100, sample(1:3,100,repl=T))) >> snp.dat <- lapply(1:200,function(x) 0:1 ) >> snp.frame <- as.data.frame(do.call(cbind,snp.dat)) >> my.dat <- cbind( my.dat,snp.frame[sample(nrow(my.dat))%%2+1,]) >> system.time( table(duplicated(my.dat)) ) > [1] 0.03 0.00 0.03 NA NA >> > > Finding the non-duplicated rows for which there is at least one replication: > >> system.time( which( (!duplicated(my.dat)) & (my.dat$ID %in% >> names(which(table(my.dat$ID)>1)) ) )) > [1] 0.05 0.00 0.05 NA NA > > >> >> >> I tried to use the aggregate function >> >> nr.attempts >> >> <-aggregate(RawSeq$GENOTYPE_ID,list(sample=RawSeq$SAMPLE_ID,assay=RawSeq$ASSAY_ID),length) >> This was simply to figure out how many times the same piece of information >> had been obtained. I ran out of patience. It took beyond forever and >> tapply >> did not perform much better. The reshape package did not help - it implied >> one was out of luck if the data was not numeric. All of my data is >> character >> or factor. >> >> Instead I used RODBC >> >> sqlSave(channel,RawSeq) >> to push the table into a Microsoft Access database >> Then a sql query, courtesy of the Microsoft Access Query Wizard a la >> design >> mode. >> >> SELECT RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID, Min(RawSeq.GENOTYPE_ID) AS >> MinOfGENOTYPE_ID, Max(RawSeq.GENOTYPE_ID) AS MaxOfGENOTYPE_ID, Count( >> RawSeq.rownames) AS CountOfrownames >> FROM RawSeq >> WHERE (((RawSeq.GENOTYPE_ID)<>"")) >> GROUP BY RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID >> ORDER BY Count(RawSeq.rownames) DESC; >> >> This way I could easily use the minimum and maximum values to see if they >> were discordant. >> Microsoft Access handled it with aplomb. I plan to use RODBC to bring the >> result of the SQL query back into R. >> >> This is the first time I have seen Microsoft Access outpace R. >> Is my observation correct or am I missing something. I would much rather >> perform all data manipulation and analyses in R. >> >> >> >> -- >> Farrel Buchinsky >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@stat.math.ethz.ch mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > Charles C. Berry (858) 534-2098 > Dept of Family/Preventive Medicine > E mailto:[EMAIL PROTECTED] UC San Diego > http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego 92093-0717 > > > > Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:[EMAIL PROTECTED] UC San Diego http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego 92093-0717 ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.