Re: [R] Merging data frames on two conditions

David Winsemius Tue, 06 Apr 2010 13:44:02 -0700


On Apr 6, 2010, at 4:03 PM, Abhishek Pratap wrote:

Hi David

Here it is. You can ignore the bio jargon if it sounds confusing.


Sometimes it is essential to have domain details.

The corresponding data type of column (SNP, chr) on which I amapplying merge is same.
merge(data_lane6_snps, data_lane6_snps_rsid , by = c("SNP,"chr"))


str(data_lane6_snps)
'data.frame':   7724462 obs. of  10 variables:
$ chr : Factor w/ 25 levels "chr1","chr10",..: 1 1 1 1 11 1 1 1 1 ...
 $ SNP           : int  100 101 103 108 179 180 191 197 218 222 ...
$ reference : Factor w/ 5 levels "A","C","G","N",..: 2 2 5 2 25 2 2 1 5 ...$ genotype : Factor w/ 10 levels "A","C","G","K",..: 1 1 1 8 22 3 8 2 2 ...
 $ consensus_qual: int  0 0 0 4 33 33 19 19 19 19 ...
 $ snp_qual      : int  0 0 0 4 0 33 19 19 19 19 ...
 $ rms_qual      : int  0 0 0 0 21 21 21 21 21 21 ...
 $ depth         : int  1 1 1 1 2 2 2 2 2 2 ...
$ bases : Factor w/ 453774 levels "^!,","^!,^!,",..: 5 5 5410998 49793 155731 284998 416878 133393 133393 ...$ base_quality : Factor w/ 555104 levels "`","``","```",..: 359359 359 54813 92856 92856 92856 92856 92539 55424 ...
> str(data_lane6_snps_rsid)
'data.frame':   797807 obs. of  4 variables:
 $ chr : Factor w/ 24 levels "1","10","11",..: 3 3 3 3 3 3 3 3 3 3 ...
$ SNP : int 68143872 11071026 69423434 12394791 1302846 953306933921381 57122299 41899656 76990037 ...

Looking at this line and the line for "SNP" in the above dataframe Iam not seeing that these are exhibiting much similarity in range.There are 10 times few observations. What was you plan for the non-matching cases? Did you really mean that you wanted a right outer join?


You might get information by trying:

length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP))

That would tell you how many potential matches you might have on thebasis of SNP numbers, Although an SNP match might or might not be afull match given the chr matching that is also being specified.

$ end : int 68143872 11071026 69423434 12394791 1302846 953306933921381 57122299 41899656 76990037 ...$ rsid: Factor w/ 797807 levels "rs10","rs10000010",..: 100229685690 505395 470219 780326 29342 29263 327909 434159 723152 ...
On Tue, Apr 6, 2010 at 3:59 PM, David Winsemius <dwinsem...@comcast.net> wrote:
On Apr 6, 2010, at 3:54 PM, Abhishek Pratap wrote:

Hi Guys

I have two data frames which I would like to merge on two conditions.

I am doing the following  (abstract form)

new.data.frame <- merge(df1,df2, by=c("Col1","Col2"))


So I am guessing that you really wanted just this:

new.data.frame <- merge(df1,df2)

?merge

Since the default for merge is: by = intersect(names(x), names(y)),this would have been equivalent to


new.data.frame <- merge(df1,df2, by=c("chr", "SNP") )

See above regarding the possibility that you have non-congruent SNPlabeling problems.


What does

 str(df1) ; str(df2)

... show?



It is giving me a null result.

Basically I need to apply two conditions.

I also tried sqldf but it is running forever. Will indexing help ?

temp <- sqldf("selecta.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid FROM

+ data_lane6_snps a,
+ data_lane6_snps_rsid b
+ WHERE
+ a.SNP = b.SNP
+ AND
+ a.chr = b.chr
+ ")

Thanks!
-Abhi

       [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT


David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging data frames on two conditions

Reply via email to