Re: [R] merging data frames
?merge Sent from my iPad On Jun 14, 2013, at 0:51, Yasin Gocgun entropy...@gmail.com wrote: Hi, I have been struggling with the issue of merging data frames that have common columns and have different dimensions. Although I made alot of search about it on internet, I could not find any function that would efficiently perform the required operation. So I would appreciate if anyone knowing how to resolve the problem would explain me the solution. As you will see, the below data frames have one common column (they would have multiple common columns in general), and I simply want to create a table that is the union of A and B, say table C. So the first row of C must include all the necessary info about the person with the respective COMPID, which is provided in A and B. A: COMPID CLR_DOT CA_TYPE CA_YEAR DT_FNDNG DT_BIOP NORMAL 1030956 XXGRX P 10 19890919 19890919 0 2511425 XXXRX T 6 19891005 19891030 0 3205129 XXGRX T 8 19900227 19900227 0 ... B: COMPID CNTR_ALL ALLOC AG_GRCAL ENRL_DT EXP_SCR 112 0 NO 1 19800122 5 121 0 NO 2 19800121 5 130 0 NO 3 19800121 5 149 0 NO 4 19800121 5 ... I tried rbind.fill but it did not work. I am aware that such operations can be done in SAS in a minute, so I thought R should be as efficient as SAS in performing such operations... Thank you in advance, Yasin [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merging data frames
Thanks for your responses. I have already found that merge function performs what I am looking for. On Fri, Jun 14, 2013 at 12:51 AM, Yasin Gocgun entropy...@gmail.com wrote: Hi, I have been struggling with the issue of merging data frames that have common columns and have different dimensions. Although I made alot of search about it on internet, I could not find any function that would efficiently perform the required operation. So I would appreciate if anyone knowing how to resolve the problem would explain me the solution. As you will see, the below data frames have one common column (they would have multiple common columns in general), and I simply want to create a table that is the union of A and B, say table C. So the first row of C must include all the necessary info about the person with the respective COMPID, which is provided in A and B. A: COMPID CLR_DOT CA_TYPE CA_YEAR DT_FNDNG DT_BIOP NORMAL 1030956 XXGRX P 10 19890919 19890919 0 2511425 XXXRX T 6 19891005 19891030 0 3205129 XXGRX T 8 19900227 19900227 0 ... B: COMPID CNTR_ALL ALLOC AG_GRCAL ENRL_DT EXP_SCR 112 0 NO 1 19800122 5 121 0 NO 2 19800121 5 130 0 NO 3 19800121 5 149 0 NO 4 19800121 5 ... I tried rbind.fill but it did not work. I am aware that such operations can be done in SAS in a minute, so I thought R should be as efficient as SAS in performing such operations... Thank you in advance, Yasin -- Yasin Gocgun [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] merging data frames
Hi, I have been struggling with the issue of merging data frames that have common columns and have different dimensions. Although I made alot of search about it on internet, I could not find any function that would efficiently perform the required operation. So I would appreciate if anyone knowing how to resolve the problem would explain me the solution. As you will see, the below data frames have one common column (they would have multiple common columns in general), and I simply want to create a table that is the union of A and B, say table C. So the first row of C must include all the necessary info about the person with the respective COMPID, which is provided in A and B. A: COMPID CLR_DOT CA_TYPE CA_YEAR DT_FNDNG DT_BIOP NORMAL 1030956 XXGRX P 10 19890919 19890919 0 2511425 XXXRX T 6 19891005 19891030 0 3205129 XXGRX T 8 19900227 19900227 0 ... B: COMPID CNTR_ALL ALLOC AG_GRCAL ENRL_DT EXP_SCR 112 0 NO 1 19800122 5 121 0 NO 2 19800121 5 130 0 NO 3 19800121 5 149 0 NO 4 19800121 5 ... I tried rbind.fill but it did not work. I am aware that such operations can be done in SAS in a minute, so I thought R should be as efficient as SAS in performing such operations... Thank you in advance, Yasin [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merging Data Frames in R
type ?merge in R - Yasir Kaheil -- View this message in context: http://r.789695.n4.nabble.com/Merging-Data-Frames-in-R-tp4636781p4636962.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Merging data frames one of which is NULL
Hello! I am running a loop. The result of each run of the loop is a data frame. I am merging all the data frames. For exampe: The dataframe from run 1: x-data.frame(a=1,b=2,c=3) The dataframe from run 2: y-data.frame(a=10,b=20,d=30) What I want to get is: merge(x,y,all.x=T,all.y=T) Then I want to merge it with the output of the 3rd run, etc. Unfortunately, I can't create the placeholder for the overall resutls BEFORE I run the loop because I don't even know how many columns I'll end up with - after merging all the data frames. I was thinking of creating an empty list: first-NULL ...and then updating it during each run by merging it with the data frame that is the output of the run. However, when I try to merge the empty list with any non-empty data frame - it ends up empty: merge(first,a,,all.x=T,all.y=T) Is there a way to make it merge while keeping everything? Thanks a lot! -- Dimitri Liakhovitski Ninah Consulting www.ninah.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merging data frames one of which is NULL
Hi Dimitri, I have some doubts whether storing the results of a loop in a data frame and merging it with every run is the most efficient way of doing things, but I do not know your situation. This does what you want, I believe, but I suspect it could be quite slow. I worked around the placeholder issue using an if statement. HTH, Josh for (i in 1:10) { x - data.frame(a = 1, b = 2, c = i) if (i == 1) { y - x } else { y - merge(x, y, all.x = TRUE, all.y = TRUE) } } On Tue, Nov 9, 2010 at 8:42 AM, Dimitri Liakhovitski dimitri.liakhovit...@gmail.com wrote: Hello! I am running a loop. The result of each run of the loop is a data frame. I am merging all the data frames. For exampe: The dataframe from run 1: x-data.frame(a=1,b=2,c=3) The dataframe from run 2: y-data.frame(a=10,b=20,d=30) What I want to get is: merge(x,y,all.x=T,all.y=T) Then I want to merge it with the output of the 3rd run, etc. Unfortunately, I can't create the placeholder for the overall resutls BEFORE I run the loop because I don't even know how many columns I'll end up with - after merging all the data frames. I was thinking of creating an empty list: first-NULL ...and then updating it during each run by merging it with the data frame that is the output of the run. However, when I try to merge the empty list with any non-empty data frame - it ends up empty: merge(first,a,,all.x=T,all.y=T) Is there a way to make it merge while keeping everything? Thanks a lot! -- Dimitri Liakhovitski Ninah Consulting www.ninah.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joshua Wiley Ph.D. Student, Health Psychology University of California, Los Angeles http://www.joshuawiley.com/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merging data frames one of which is NULL
Thanks a lot, Joshua. You might be right. I am thinking of creating a list (as a placeholder) and then merging the elements of the list. Dimitri On Tue, Nov 9, 2010 at 12:11 PM, Joshua Wiley jwiley.ps...@gmail.com wrote: Hi Dimitri, I have some doubts whether storing the results of a loop in a data frame and merging it with every run is the most efficient way of doing things, but I do not know your situation. This does what you want, I believe, but I suspect it could be quite slow. I worked around the placeholder issue using an if statement. HTH, Josh for (i in 1:10) { x - data.frame(a = 1, b = 2, c = i) if (i == 1) { y - x } else { y - merge(x, y, all.x = TRUE, all.y = TRUE) } } On Tue, Nov 9, 2010 at 8:42 AM, Dimitri Liakhovitski dimitri.liakhovit...@gmail.com wrote: Hello! I am running a loop. The result of each run of the loop is a data frame. I am merging all the data frames. For exampe: The dataframe from run 1: x-data.frame(a=1,b=2,c=3) The dataframe from run 2: y-data.frame(a=10,b=20,d=30) What I want to get is: merge(x,y,all.x=T,all.y=T) Then I want to merge it with the output of the 3rd run, etc. Unfortunately, I can't create the placeholder for the overall resutls BEFORE I run the loop because I don't even know how many columns I'll end up with - after merging all the data frames. I was thinking of creating an empty list: first-NULL ...and then updating it during each run by merging it with the data frame that is the output of the run. However, when I try to merge the empty list with any non-empty data frame - it ends up empty: merge(first,a,,all.x=T,all.y=T) Is there a way to make it merge while keeping everything? Thanks a lot! -- Dimitri Liakhovitski Ninah Consulting www.ninah.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joshua Wiley Ph.D. Student, Health Psychology University of California, Los Angeles http://www.joshuawiley.com/ -- Dimitri Liakhovitski Ninah Consulting www.ninah.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merging data frames one of which is NULL
Dimitri - Usually the easiest way to solve problems like this is to put all the dataframes in a list, and then use the Reduce() function to merge them all together at the end. You don't give many details about how the data frames are constructed, so it's hard to be specific about the best way to put them in a list, but this short example should give you an idea of what I'm talking about: x-data.frame(a=1,b=2,c=3) y-data.frame(a=10,b=20,d=30) z-data.frame(a=12,b=19,f=25) a-data.frame(a=9,b=10,g=15) Reduce(function(x,y)merge(x,y,all=TRUE),list(x,y,z,a)) a b c d f g 1 1 2 3 NA NA NA 2 9 10 NA NA NA 15 3 10 20 NA 30 NA NA 4 12 19 NA NA 25 NA Hope this helps. - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spec...@stat.berkeley.edu On Tue, 9 Nov 2010, Dimitri Liakhovitski wrote: Hello! I am running a loop. The result of each run of the loop is a data frame. I am merging all the data frames. For exampe: The dataframe from run 1: x-data.frame(a=1,b=2,c=3) The dataframe from run 2: y-data.frame(a=10,b=20,d=30) What I want to get is: merge(x,y,all.x=T,all.y=T) Then I want to merge it with the output of the 3rd run, etc. Unfortunately, I can't create the placeholder for the overall resutls BEFORE I run the loop because I don't even know how many columns I'll end up with - after merging all the data frames. I was thinking of creating an empty list: first-NULL ...and then updating it during each run by merging it with the data frame that is the output of the run. However, when I try to merge the empty list with any non-empty data frame - it ends up empty: merge(first,a,,all.x=T,all.y=T) Is there a way to make it merge while keeping everything? Thanks a lot! -- Dimitri Liakhovitski Ninah Consulting www.ninah.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merging data frames one of which is NULL
Thanks a lot, Phil. I decided to do it via the list - as you suggested, but had to do some gymnastics, which Reduce will greatly help me to avoid now! Dimitri On Tue, Nov 9, 2010 at 12:36 PM, Phil Spector spec...@stat.berkeley.edu wrote: Dimitri - Usually the easiest way to solve problems like this is to put all the dataframes in a list, and then use the Reduce() function to merge them all together at the end. You don't give many details about how the data frames are constructed, so it's hard to be specific about the best way to put them in a list, but this short example should give you an idea of what I'm talking about: x-data.frame(a=1,b=2,c=3) y-data.frame(a=10,b=20,d=30) z-data.frame(a=12,b=19,f=25) a-data.frame(a=9,b=10,g=15) Reduce(function(x,y)merge(x,y,all=TRUE),list(x,y,z,a)) a b c d f g 1 1 2 3 NA NA NA 2 9 10 NA NA NA 15 3 10 20 NA 30 NA NA 4 12 19 NA NA 25 NA Hope this helps. - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spec...@stat.berkeley.edu On Tue, 9 Nov 2010, Dimitri Liakhovitski wrote: Hello! I am running a loop. The result of each run of the loop is a data frame. I am merging all the data frames. For exampe: The dataframe from run 1: x-data.frame(a=1,b=2,c=3) The dataframe from run 2: y-data.frame(a=10,b=20,d=30) What I want to get is: merge(x,y,all.x=T,all.y=T) Then I want to merge it with the output of the 3rd run, etc. Unfortunately, I can't create the placeholder for the overall resutls BEFORE I run the loop because I don't even know how many columns I'll end up with - after merging all the data frames. I was thinking of creating an empty list: first-NULL ...and then updating it during each run by merging it with the data frame that is the output of the run. However, when I try to merge the empty list with any non-empty data frame - it ends up empty: merge(first,a,,all.x=T,all.y=T) Is there a way to make it merge while keeping everything? Thanks a lot! -- Dimitri Liakhovitski Ninah Consulting www.ninah.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Dimitri Liakhovitski Ninah Consulting www.ninah.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Merging data frames on a variety of columns
Hello, This is a semi-complicated question about comparing two datasets, probably using merge, but I am open to other ideas. I have a large frame of information about companies. It's over 30,000 rows and looks something like... df1 - identifier1 identifier2nameother_nameyear H34 C56 ACME ACME_LTD 2001 H34 NAACME ACME_LTD 2002 X20 C40 FOO_COFOO_CO 2004 NANABAR_SABAR_SAB2004 NANABAR_SABAR_SAB2005 As you can see, many observations are missing values. I have a second data frame with information about these same companies, in fewer rows, and often with slightly different info... df2 - identifier1 identifier2name year H34 NAACME_LTD 2001 H34 NAACME_LTD 2002 X20 C40 FOO2004 The idea is to figure out which companies in the first set are not in the second set. My approach so far is to do various merges and then remove the matches from the original data frame... m1 - merge(df1, df2, by = c(identifier1, identifier2, year), incomparables=NA) m2 - merge(df1, df2, by = c(name, year), incomparables=NA) m3 - merge(df1, df2, by.x = c(other_name, year), by.y = c(name, year), incomparables = NA) Is this really the best way to accomplish my goal? Also, for some reason when I do merges like m3, my resulting data frame is missing columns and I am getting rows that do not appear to match on the variables I have specified, e.g. ... year other_name identifier1name identifier2 2001 AMDOCS_LTDG0260210 AMDOCS_LTDED C42913 Help is much appreciated, Chris __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] merging data frames
Hi, is it possible to merge two data frames while preserving the row names of the bigger data frame? I have two data frames which i would like to combine. While doing so I always loose the row names. When I try to append this, I get the error message, that I have non-unique names. This although I used unique command on the data frame where the double inputs supposedly are thanks for the help Assa [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merging data frames
Put the rownames as another column in your dataframe so that it remains with the data. After merging, you can then use it as the rownames On Mon, Jun 14, 2010 at 9:25 AM, Assa Yeroslaviz fry...@gmail.com wrote: Hi, is it possible to merge two data frames while preserving the row names of the bigger data frame? I have two data frames which i would like to combine. While doing so I always loose the row names. When I try to append this, I get the error message, that I have non-unique names. This although I used unique command on the data frame where the double inputs supposedly are thanks for the help Assa [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merging data frames
If you want to keep only the rows that are unique in the first column then do the following: workComb1 - subset(workComb, !duplicated(ProbeID)) On Mon, Jun 14, 2010 at 11:20 AM, Assa Yeroslaviz fry...@gmail.com wrote: well, the problem is basically elsewhere. I have a data frame with expression data and doubled IDs in the first column (see example) when I want to put them into row names I get the message, that there are non-unique items in the data. So I tried with unique to delete such rows. The problem is unique doesn't delete all of them. I compare two data frames with their Probe IDs. I would like to delete all double lines with a certain probe ID independent from the rest of the line, as to say I would like a data frame with single unique idetifiers in the Probe Id column. merge doesn't give me that. It doesn't delete all similar line, if the line are not identical in the other columns it leaves them in the table. Is there a way of deleting whole the line with double Probe IDs? workbook - read.delim(file = workbook1.txt, quote = , sep = \t) GeneID - read.delim(file = testTable.txt, quote = , sep = \t) workComb - merge(workbook, GeneID, by.x = ProbeID, by.y = Probe.Id) workComb1 - unique(workComb) write.table(workComb, file = workComb.txt , sep = \t, quote = FALSE, row.names = FALSE) write.table(workComb1, file = workComb1.txt , sep = \t, quote = FALSE, row.names = FALSE) look at lines 49 and 50 in the file workComb1.txt after using unique on the file. The line are identical with the exception of the Transcript ID. I would like to take one of them out of the table. THX, Assa On Mon, Jun 14, 2010 at 15:33, jim holtman jholt...@gmail.com wrote: Put the rownames as another column in your dataframe so that it remains with the data. After merging, you can then use it as the rownames On Mon, Jun 14, 2010 at 9:25 AM, Assa Yeroslaviz fry...@gmail.com wrote: Hi, is it possible to merge two data frames while preserving the row names of the bigger data frame? I have two data frames which i would like to combine. While doing so I always loose the row names. When I try to append this, I get the error message, that I have non-unique names. This although I used unique command on the data frame where the double inputs supposedly are thanks for the help Assa [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Merging data frames on two conditions
Hi Guys I have two data frames which I would like to merge on two conditions. I am doing the following (abstract form) new.data.frame - merge(df1,df2, by=c(Col1,Col2)) It is giving me a null result. Basically I need to apply two conditions. I also tried sqldf but it is running forever. Will indexing help ? temp - sqldf(select a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid FROM + data_lane6_snps a, + data_lane6_snps_rsid b + WHERE + a.SNP = b.SNP + AND + a.chr = b.chr + ) Thanks! -Abhi [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merging data frames on two conditions
On Apr 6, 2010, at 3:54 PM, Abhishek Pratap wrote: Hi Guys I have two data frames which I would like to merge on two conditions. I am doing the following (abstract form) new.data.frame - merge(df1,df2, by=c(Col1,Col2)) What does str(df1) ; str(df2) ... show? It is giving me a null result. Basically I need to apply two conditions. I also tried sqldf but it is running forever. Will indexing help ? temp - sqldf(select a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid FROM + data_lane6_snps a, + data_lane6_snps_rsid b + WHERE + a.SNP = b.SNP + AND + a.chr = b.chr + ) Thanks! -Abhi [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merging data frames on two conditions
Hi David Here it is. You can ignore the bio jargon if it sounds confusing. The corresponding data type of column (SNP, chr) on which I am applying merge is same. merge(data_lane6_snps, data_lane6_snps_rsid , by = c(SNP,chr)) str(data_lane6_snps) 'data.frame': 7724462 obs. of 10 variables: $ chr : Factor w/ 25 levels chr1,chr10,..: 1 1 1 1 1 1 1 1 1 1 ... $ SNP : int 100 101 103 108 179 180 191 197 218 222 ... $ reference : Factor w/ 5 levels A,C,G,N,..: 2 2 5 2 2 5 2 2 1 5 ... $ genotype : Factor w/ 10 levels A,C,G,K,..: 1 1 1 8 2 2 3 8 2 2 ... $ consensus_qual: int 0 0 0 4 33 33 19 19 19 19 ... $ snp_qual : int 0 0 0 4 0 33 19 19 19 19 ... $ rms_qual : int 0 0 0 0 21 21 21 21 21 21 ... $ depth : int 1 1 1 1 2 2 2 2 2 2 ... $ bases : Factor w/ 453774 levels ^!,,^!,^!,,..: 5 5 5 410998 49793 155731 284998 416878 133393 133393 ... $ base_quality : Factor w/ 555104 levels `,``,```,..: 359 359 359 54813 92856 92856 92856 92856 92539 55424 ... str(data_lane6_snps_rsid) 'data.frame': 797807 obs. of 4 variables: $ chr : Factor w/ 24 levels 1,10,11,..: 3 3 3 3 3 3 3 3 3 3 ... $ SNP : int 68143872 11071026 69423434 12394791 1302846 95330693 3921381 57122299 41899656 76990037 ... $ end : int 68143872 11071026 69423434 12394791 1302846 95330693 3921381 57122299 41899656 76990037 ... $ rsid: Factor w/ 797807 levels rs10,rs1010,..: 100229 685690 505395 470219 780326 29342 29263 327909 434159 723152 ... On Tue, Apr 6, 2010 at 3:59 PM, David Winsemius dwinsem...@comcast.netwrote: On Apr 6, 2010, at 3:54 PM, Abhishek Pratap wrote: Hi Guys I have two data frames which I would like to merge on two conditions. I am doing the following (abstract form) new.data.frame - merge(df1,df2, by=c(Col1,Col2)) What does str(df1) ; str(df2) ... show? It is giving me a null result. Basically I need to apply two conditions. I also tried sqldf but it is running forever. Will indexing help ? temp - sqldf(select a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid FROM + data_lane6_snps a, + data_lane6_snps_rsid b + WHERE + a.SNP = b.SNP + AND + a.chr = b.chr + ) Thanks! -Abhi [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merging data frames on two conditions
And I should also add that if I merge only on one column it works fine but the result is not what I want. merge(data_lane6_snps, data_lane6_snps_rsid , by = c(SNP) : works as expected. Is the chr column being a factor creating probs here ? -A On Tue, Apr 6, 2010 at 4:03 PM, Abhishek Pratap abhishek@gmail.comwrote: Hi David Here it is. You can ignore the bio jargon if it sounds confusing. The corresponding data type of column (SNP, chr) on which I am applying merge is same. merge(data_lane6_snps, data_lane6_snps_rsid , by = c(SNP,chr)) str(data_lane6_snps) 'data.frame': 7724462 obs. of 10 variables: $ chr : Factor w/ 25 levels chr1,chr10,..: 1 1 1 1 1 1 1 1 1 1 ... $ SNP : int 100 101 103 108 179 180 191 197 218 222 ... $ reference : Factor w/ 5 levels A,C,G,N,..: 2 2 5 2 2 5 2 2 1 5 ... $ genotype : Factor w/ 10 levels A,C,G,K,..: 1 1 1 8 2 2 3 8 2 2 ... $ consensus_qual: int 0 0 0 4 33 33 19 19 19 19 ... $ snp_qual : int 0 0 0 4 0 33 19 19 19 19 ... $ rms_qual : int 0 0 0 0 21 21 21 21 21 21 ... $ depth : int 1 1 1 1 2 2 2 2 2 2 ... $ bases : Factor w/ 453774 levels ^!,,^!,^!,,..: 5 5 5 410998 49793 155731 284998 416878 133393 133393 ... $ base_quality : Factor w/ 555104 levels `,``,```,..: 359 359 359 54813 92856 92856 92856 92856 92539 55424 ... str(data_lane6_snps_rsid) 'data.frame': 797807 obs. of 4 variables: $ chr : Factor w/ 24 levels 1,10,11,..: 3 3 3 3 3 3 3 3 3 3 ... $ SNP : int 68143872 11071026 69423434 12394791 1302846 95330693 3921381 57122299 41899656 76990037 ... $ end : int 68143872 11071026 69423434 12394791 1302846 95330693 3921381 57122299 41899656 76990037 ... $ rsid: Factor w/ 797807 levels rs10,rs1010,..: 100229 685690 505395 470219 780326 29342 29263 327909 434159 723152 ... On Tue, Apr 6, 2010 at 3:59 PM, David Winsemius dwinsem...@comcast.netwrote: On Apr 6, 2010, at 3:54 PM, Abhishek Pratap wrote: Hi Guys I have two data frames which I would like to merge on two conditions. I am doing the following (abstract form) new.data.frame - merge(df1,df2, by=c(Col1,Col2)) What does str(df1) ; str(df2) ... show? It is giving me a null result. Basically I need to apply two conditions. I also tried sqldf but it is running forever. Will indexing help ? temp - sqldf(select a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid FROM + data_lane6_snps a, + data_lane6_snps_rsid b + WHERE + a.SNP = b.SNP + AND + a.chr = b.chr + ) Thanks! -Abhi [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merging data frames on two conditions
On Apr 6, 2010, at 4:03 PM, Abhishek Pratap wrote: Hi David Here it is. You can ignore the bio jargon if it sounds confusing. Sometimes it is essential to have domain details. The corresponding data type of column (SNP, chr) on which I am applying merge is same. merge(data_lane6_snps, data_lane6_snps_rsid , by = c(SNP,chr)) str(data_lane6_snps) 'data.frame': 7724462 obs. of 10 variables: $ chr : Factor w/ 25 levels chr1,chr10,..: 1 1 1 1 1 1 1 1 1 1 ... $ SNP : int 100 101 103 108 179 180 191 197 218 222 ... $ reference : Factor w/ 5 levels A,C,G,N,..: 2 2 5 2 2 5 2 2 1 5 ... $ genotype : Factor w/ 10 levels A,C,G,K,..: 1 1 1 8 2 2 3 8 2 2 ... $ consensus_qual: int 0 0 0 4 33 33 19 19 19 19 ... $ snp_qual : int 0 0 0 4 0 33 19 19 19 19 ... $ rms_qual : int 0 0 0 0 21 21 21 21 21 21 ... $ depth : int 1 1 1 1 2 2 2 2 2 2 ... $ bases : Factor w/ 453774 levels ^!,,^!,^!,,..: 5 5 5 410998 49793 155731 284998 416878 133393 133393 ... $ base_quality : Factor w/ 555104 levels `,``,```,..: 359 359 359 54813 92856 92856 92856 92856 92539 55424 ... str(data_lane6_snps_rsid) 'data.frame': 797807 obs. of 4 variables: $ chr : Factor w/ 24 levels 1,10,11,..: 3 3 3 3 3 3 3 3 3 3 ... $ SNP : int 68143872 11071026 69423434 12394791 1302846 95330693 3921381 57122299 41899656 76990037 ... Looking at this line and the line for SNP in the above dataframe I am not seeing that these are exhibiting much similarity in range. There are 10 times few observations. What was you plan for the non- matching cases? Did you really mean that you wanted a right outer join? You might get information by trying: length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP)) That would tell you how many potential matches you might have on the basis of SNP numbers, Although an SNP match might or might not be a full match given the chr matching that is also being specified. $ end : int 68143872 11071026 69423434 12394791 1302846 95330693 3921381 57122299 41899656 76990037 ... $ rsid: Factor w/ 797807 levels rs10,rs1010,..: 100229 685690 505395 470219 780326 29342 29263 327909 434159 723152 ... On Tue, Apr 6, 2010 at 3:59 PM, David Winsemius dwinsem...@comcast.net wrote: On Apr 6, 2010, at 3:54 PM, Abhishek Pratap wrote: Hi Guys I have two data frames which I would like to merge on two conditions. I am doing the following (abstract form) new.data.frame - merge(df1,df2, by=c(Col1,Col2)) So I am guessing that you really wanted just this: new.data.frame - merge(df1,df2) ?merge Since the default for merge is: by = intersect(names(x), names(y)), this would have been equivalent to new.data.frame - merge(df1,df2, by=c(chr, SNP) ) See above regarding the possibility that you have non-congruent SNP labeling problems. What does str(df1) ; str(df2) ... show? It is giving me a null result. Basically I need to apply two conditions. I also tried sqldf but it is running forever. Will indexing help ? temp - sqldf(select a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid FROM + data_lane6_snps a, + data_lane6_snps_rsid b + WHERE + a.SNP = b.SNP + AND + a.chr = b.chr + ) Thanks! -Abhi [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merging data frames on two conditions
Hi David I can understand looking the SNP data values it can be felt that they are different values and hence no result in merge. However the columns still have ~700K SNPs common. What I am looking for is a merge where the SNP and Chr matches. If I match only the SNP column I get partially correct results since it is possible for two chromosomes to have a SNP at the same bp location so the merge needs to take both SNP position and Chromosome into account. Thanks! -Abhi On Tue, Apr 6, 2010 at 4:42 PM, David Winsemius dwinsem...@comcast.netwrote: On Apr 6, 2010, at 4:03 PM, Abhishek Pratap wrote: Hi David Here it is. You can ignore the bio jargon if it sounds confusing. Sometimes it is essential to have domain details. The corresponding data type of column (SNP, chr) on which I am applying merge is same. merge(data_lane6_snps, data_lane6_snps_rsid , by = c(SNP,chr)) str(data_lane6_snps) 'data.frame': 7724462 obs. of 10 variables: $ chr : Factor w/ 25 levels chr1,chr10,..: 1 1 1 1 1 1 1 1 1 1 ... $ SNP : int 100 101 103 108 179 180 191 197 218 222 ... $ reference : Factor w/ 5 levels A,C,G,N,..: 2 2 5 2 2 5 2 2 1 5 ... $ genotype : Factor w/ 10 levels A,C,G,K,..: 1 1 1 8 2 2 3 8 2 2 ... $ consensus_qual: int 0 0 0 4 33 33 19 19 19 19 ... $ snp_qual : int 0 0 0 4 0 33 19 19 19 19 ... $ rms_qual : int 0 0 0 0 21 21 21 21 21 21 ... $ depth : int 1 1 1 1 2 2 2 2 2 2 ... $ bases : Factor w/ 453774 levels ^!,,^!,^!,,..: 5 5 5 410998 49793 155731 284998 416878 133393 133393 ... $ base_quality : Factor w/ 555104 levels `,``,```,..: 359 359 359 54813 92856 92856 92856 92856 92539 55424 ... str(data_lane6_snps_rsid) 'data.frame': 797807 obs. of 4 variables: $ chr : Factor w/ 24 levels 1,10,11,..: 3 3 3 3 3 3 3 3 3 3 ... $ SNP : int 68143872 11071026 69423434 12394791 1302846 95330693 3921381 57122299 41899656 76990037 ... Looking at this line and the line for SNP in the above dataframe I am not seeing that these are exhibiting much similarity in range. There are 10 times few observations. What was you plan for the non-matching cases? Did you really mean that you wanted a right outer join? You might get information by trying: length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP)) That would tell you how many potential matches you might have on the basis of SNP numbers, Although an SNP match might or might not be a full match given the chr matching that is also being specified. $ end : int 68143872 11071026 69423434 12394791 1302846 95330693 3921381 57122299 41899656 76990037 ... $ rsid: Factor w/ 797807 levels rs10,rs1010,..: 100229 685690 505395 470219 780326 29342 29263 327909 434159 723152 ... On Tue, Apr 6, 2010 at 3:59 PM, David Winsemius dwinsem...@comcast.net wrote: On Apr 6, 2010, at 3:54 PM, Abhishek Pratap wrote: Hi Guys I have two data frames which I would like to merge on two conditions. I am doing the following (abstract form) new.data.frame - merge(df1,df2, by=c(Col1,Col2)) So I am guessing that you really wanted just this: new.data.frame - merge(df1,df2) ?merge Since the default for merge is: by = intersect(names(x), names(y)), this would have been equivalent to new.data.frame - merge(df1,df2, by=c(chr, SNP) ) See above regarding the possibility that you have non-congruent SNP labeling problems. What does str(df1) ; str(df2) ... show? It is giving me a null result. Basically I need to apply two conditions. I also tried sqldf but it is running forever. Will indexing help ? temp - sqldf(select a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid FROM + data_lane6_snps a, + data_lane6_snps_rsid b + WHERE + a.SNP = b.SNP + AND + a.chr = b.chr + ) Thanks! -Abhi [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT David Winsemius, MD West Hartford, CT [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merging data frames on two conditions
Just so you know length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP)) 796120 I just need to include the chr condition now where I am stuck. -Abhi On Tue, Apr 6, 2010 at 4:51 PM, Abhishek Pratap abhishek@gmail.comwrote: Hi David I can understand looking the SNP data values it can be felt that they are different values and hence no result in merge. However the columns still have ~700K SNPs common. What I am looking for is a merge where the SNP and Chr matches. If I match only the SNP column I get partially correct results since it is possible for two chromosomes to have a SNP at the same bp location so the merge needs to take both SNP position and Chromosome into account. Thanks! -Abhi On Tue, Apr 6, 2010 at 4:42 PM, David Winsemius dwinsem...@comcast.netwrote: On Apr 6, 2010, at 4:03 PM, Abhishek Pratap wrote: Hi David Here it is. You can ignore the bio jargon if it sounds confusing. Sometimes it is essential to have domain details. The corresponding data type of column (SNP, chr) on which I am applying merge is same. merge(data_lane6_snps, data_lane6_snps_rsid , by = c(SNP,chr)) str(data_lane6_snps) 'data.frame': 7724462 obs. of 10 variables: $ chr : Factor w/ 25 levels chr1,chr10,..: 1 1 1 1 1 1 1 1 1 1 ... $ SNP : int 100 101 103 108 179 180 191 197 218 222 ... $ reference : Factor w/ 5 levels A,C,G,N,..: 2 2 5 2 2 5 2 2 1 5 ... $ genotype : Factor w/ 10 levels A,C,G,K,..: 1 1 1 8 2 2 3 8 2 2 ... $ consensus_qual: int 0 0 0 4 33 33 19 19 19 19 ... $ snp_qual : int 0 0 0 4 0 33 19 19 19 19 ... $ rms_qual : int 0 0 0 0 21 21 21 21 21 21 ... $ depth : int 1 1 1 1 2 2 2 2 2 2 ... $ bases : Factor w/ 453774 levels ^!,,^!,^!,,..: 5 5 5 410998 49793 155731 284998 416878 133393 133393 ... $ base_quality : Factor w/ 555104 levels `,``,```,..: 359 359 359 54813 92856 92856 92856 92856 92539 55424 ... str(data_lane6_snps_rsid) 'data.frame': 797807 obs. of 4 variables: $ chr : Factor w/ 24 levels 1,10,11,..: 3 3 3 3 3 3 3 3 3 3 ... $ SNP : int 68143872 11071026 69423434 12394791 1302846 95330693 3921381 57122299 41899656 76990037 ... Looking at this line and the line for SNP in the above dataframe I am not seeing that these are exhibiting much similarity in range. There are 10 times few observations. What was you plan for the non-matching cases? Did you really mean that you wanted a right outer join? You might get information by trying: length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP)) That would tell you how many potential matches you might have on the basis of SNP numbers, Although an SNP match might or might not be a full match given the chr matching that is also being specified. $ end : int 68143872 11071026 69423434 12394791 1302846 95330693 3921381 57122299 41899656 76990037 ... $ rsid: Factor w/ 797807 levels rs10,rs1010,..: 100229 685690 505395 470219 780326 29342 29263 327909 434159 723152 ... On Tue, Apr 6, 2010 at 3:59 PM, David Winsemius dwinsem...@comcast.net wrote: On Apr 6, 2010, at 3:54 PM, Abhishek Pratap wrote: Hi Guys I have two data frames which I would like to merge on two conditions. I am doing the following (abstract form) new.data.frame - merge(df1,df2, by=c(Col1,Col2)) So I am guessing that you really wanted just this: new.data.frame - merge(df1,df2) ?merge Since the default for merge is: by = intersect(names(x), names(y)), this would have been equivalent to new.data.frame - merge(df1,df2, by=c(chr, SNP) ) See above regarding the possibility that you have non-congruent SNP labeling problems. What does str(df1) ; str(df2) ... show? It is giving me a null result. Basically I need to apply two conditions. I also tried sqldf but it is running forever. Will indexing help ? temp - sqldf(select a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid FROM + data_lane6_snps a, + data_lane6_snps_rsid b + WHERE + a.SNP = b.SNP + AND + a.chr = b.chr + ) Thanks! -Abhi [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT David Winsemius, MD West Hartford, CT [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merging data frames on two conditions
OK, not the SNP's. So look at the chr's. I will bet that you get 0 when you try : length(intersect(data_lane6_snps$chr, data_lane6_snps_rsid$chr)) ... since one is using a format of chrNN and the other is using just NN. You need to get the chromosome naming convention straightened out. -- David. On Apr 6, 2010, at 4:53 PM, Abhishek Pratap wrote: Just so you know length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP)) 796120 I just need to include the chr condition now where I am stuck. -Abhi On Tue, Apr 6, 2010 at 4:51 PM, Abhishek Pratap abhishek@gmail.com wrote: Hi David I can understand looking the SNP data values it can be felt that they are different values and hence no result in merge. However the columns still have ~700K SNPs common. What I am looking for is a merge where the SNP and Chr matches. If I match only the SNP column I get partially correct results since it is possible for two chromosomes to have a SNP at the same bp location so the merge needs to take both SNP position and Chromosome into account. Thanks! -Abhi On Tue, Apr 6, 2010 at 4:42 PM, David Winsemius dwinsem...@comcast.net wrote: On Apr 6, 2010, at 4:03 PM, Abhishek Pratap wrote: Hi David Here it is. You can ignore the bio jargon if it sounds confusing. Sometimes it is essential to have domain details. The corresponding data type of column (SNP, chr) on which I am applying merge is same. merge(data_lane6_snps, data_lane6_snps_rsid , by = c(SNP,chr)) str(data_lane6_snps) 'data.frame': 7724462 obs. of 10 variables: $ chr : Factor w/ 25 levels chr1,chr10,..: 1 1 1 1 1 1 1 1 1 1 ... $ SNP : int 100 101 103 108 179 180 191 197 218 222 ... $ reference : Factor w/ 5 levels A,C,G,N,..: 2 2 5 2 2 5 2 2 1 5 ... $ genotype : Factor w/ 10 levels A,C,G,K,..: 1 1 1 8 2 2 3 8 2 2 ... $ consensus_qual: int 0 0 0 4 33 33 19 19 19 19 ... $ snp_qual : int 0 0 0 4 0 33 19 19 19 19 ... $ rms_qual : int 0 0 0 0 21 21 21 21 21 21 ... $ depth : int 1 1 1 1 2 2 2 2 2 2 ... $ bases : Factor w/ 453774 levels ^!,,^!,^!,,..: 5 5 5 410998 49793 155731 284998 416878 133393 133393 ... $ base_quality : Factor w/ 555104 levels `,``,```,..: 359 359 359 54813 92856 92856 92856 92856 92539 55424 ... str(data_lane6_snps_rsid) 'data.frame': 797807 obs. of 4 variables: $ chr : Factor w/ 24 levels 1,10,11,..: 3 3 3 3 3 3 3 3 3 3 ... $ SNP : int 68143872 11071026 69423434 12394791 1302846 95330693 3921381 57122299 41899656 76990037 ... Looking at this line and the line for SNP in the above dataframe I am not seeing that these are exhibiting much similarity in range. There are 10 times few observations. What was you plan for the non- matching cases? Did you really mean that you wanted a right outer join? You might get information by trying: length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP)) That would tell you how many potential matches you might have on the basis of SNP numbers, Although an SNP match might or might not be a full match given the chr matching that is also being specified. $ end : int 68143872 11071026 69423434 12394791 1302846 95330693 3921381 57122299 41899656 76990037 ... $ rsid: Factor w/ 797807 levels rs10,rs1010,..: 100229 685690 505395 470219 780326 29342 29263 327909 434159 723152 ... On Tue, Apr 6, 2010 at 3:59 PM, David Winsemius dwinsem...@comcast.net wrote: On Apr 6, 2010, at 3:54 PM, Abhishek Pratap wrote: Hi Guys I have two data frames which I would like to merge on two conditions. I am doing the following (abstract form) new.data.frame - merge(df1,df2, by=c(Col1,Col2)) So I am guessing that you really wanted just this: new.data.frame - merge(df1,df2) ?merge Since the default for merge is: by = intersect(names(x), names(y)), this would have been equivalent to new.data.frame - merge(df1,df2, by=c(chr, SNP) ) See above regarding the possibility that you have non-congruent SNP labeling problems. What does str(df1) ; str(df2) ... show? It is giving me a null result. Basically I need to apply two conditions. I also tried sqldf but it is running forever. Will indexing help ? temp - sqldf(select a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid FROM + data_lane6_snps a, + data_lane6_snps_rsid b + WHERE + a.SNP = b.SNP + AND + a.chr = b.chr + ) Thanks! -Abhi [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT David Winsemius, MD West Hartford, CT David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list
Re: [R] Merging data frames on two conditions
You got the error. It is different naming convention of chr. I should be able to fix that pretty easily. In case the problem persists, I will contact the list. Thanks! -Abhi On Tue, Apr 6, 2010 at 5:01 PM, David Winsemius dwinsem...@comcast.netwrote: OK, not the SNP's. So look at the chr's. I will bet that you get 0 when you try : length(intersect(data_lane6_snps$chr, data_lane6_snps_rsid$chr)) ... since one is using a format of chrNN and the other is using just NN. You need to get the chromosome naming convention straightened out. -- David. On Apr 6, 2010, at 4:53 PM, Abhishek Pratap wrote: Just so you know length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP)) 796120 I just need to include the chr condition now where I am stuck. -Abhi On Tue, Apr 6, 2010 at 4:51 PM, Abhishek Pratap abhishek@gmail.com wrote: Hi David I can understand looking the SNP data values it can be felt that they are different values and hence no result in merge. However the columns still have ~700K SNPs common. What I am looking for is a merge where the SNP and Chr matches. If I match only the SNP column I get partially correct results since it is possible for two chromosomes to have a SNP at the same bp location so the merge needs to take both SNP position and Chromosome into account. Thanks! -Abhi On Tue, Apr 6, 2010 at 4:42 PM, David Winsemius dwinsem...@comcast.net wrote: On Apr 6, 2010, at 4:03 PM, Abhishek Pratap wrote: Hi David Here it is. You can ignore the bio jargon if it sounds confusing. Sometimes it is essential to have domain details. The corresponding data type of column (SNP, chr) on which I am applying merge is same. merge(data_lane6_snps, data_lane6_snps_rsid , by = c(SNP,chr)) str(data_lane6_snps) 'data.frame': 7724462 obs. of 10 variables: $ chr : Factor w/ 25 levels chr1,chr10,..: 1 1 1 1 1 1 1 1 1 1 ... $ SNP : int 100 101 103 108 179 180 191 197 218 222 ... $ reference : Factor w/ 5 levels A,C,G,N,..: 2 2 5 2 2 5 2 2 1 5 ... $ genotype : Factor w/ 10 levels A,C,G,K,..: 1 1 1 8 2 2 3 8 2 2 ... $ consensus_qual: int 0 0 0 4 33 33 19 19 19 19 ... $ snp_qual : int 0 0 0 4 0 33 19 19 19 19 ... $ rms_qual : int 0 0 0 0 21 21 21 21 21 21 ... $ depth : int 1 1 1 1 2 2 2 2 2 2 ... $ bases : Factor w/ 453774 levels ^!,,^!,^!,,..: 5 5 5 410998 49793 155731 284998 416878 133393 133393 ... $ base_quality : Factor w/ 555104 levels `,``,```,..: 359 359 359 54813 92856 92856 92856 92856 92539 55424 ... str(data_lane6_snps_rsid) 'data.frame': 797807 obs. of 4 variables: $ chr : Factor w/ 24 levels 1,10,11,..: 3 3 3 3 3 3 3 3 3 3 ... $ SNP : int 68143872 11071026 69423434 12394791 1302846 95330693 3921381 57122299 41899656 76990037 ... Looking at this line and the line for SNP in the above dataframe I am not seeing that these are exhibiting much similarity in range. There are 10 times few observations. What was you plan for the non-matching cases? Did you really mean that you wanted a right outer join? You might get information by trying: length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP)) That would tell you how many potential matches you might have on the basis of SNP numbers, Although an SNP match might or might not be a full match given the chr matching that is also being specified. $ end : int 68143872 11071026 69423434 12394791 1302846 95330693 3921381 57122299 41899656 76990037 ... $ rsid: Factor w/ 797807 levels rs10,rs1010,..: 100229 685690 505395 470219 780326 29342 29263 327909 434159 723152 ... On Tue, Apr 6, 2010 at 3:59 PM, David Winsemius dwinsem...@comcast.net wrote: On Apr 6, 2010, at 3:54 PM, Abhishek Pratap wrote: Hi Guys I have two data frames which I would like to merge on two conditions. I am doing the following (abstract form) new.data.frame - merge(df1,df2, by=c(Col1,Col2)) So I am guessing that you really wanted just this: new.data.frame - merge(df1,df2) ?merge Since the default for merge is: by = intersect(names(x), names(y)), this would have been equivalent to new.data.frame - merge(df1,df2, by=c(chr, SNP) ) See above regarding the possibility that you have non-congruent SNP labeling problems. What does str(df1) ; str(df2) ... show? It is giving me a null result. Basically I need to apply two conditions. I also tried sqldf but it is running forever. Will indexing help ? temp - sqldf(select a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid FROM + data_lane6_snps a, + data_lane6_snps_rsid b + WHERE + a.SNP = b.SNP + AND + a.chr = b.chr + ) Thanks! -Abhi [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented,
Re: [R] Merging data frames on two conditions
Yes, indexing will typically make a large difference. On Tue, Apr 6, 2010 at 3:54 PM, Abhishek Pratap abhishek@gmail.com wrote: Hi Guys I have two data frames which I would like to merge on two conditions. I am doing the following (abstract form) new.data.frame - merge(df1,df2, by=c(Col1,Col2)) It is giving me a null result. Basically I need to apply two conditions. I also tried sqldf but it is running forever. Will indexing help ? temp - sqldf(select a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid FROM + data_lane6_snps a, + data_lane6_snps_rsid b + WHERE + a.SNP = b.SNP + AND + a.chr = b.chr + ) Thanks! -Abhi [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merging data frames gives all NAs
David, Now the code is: for (j in seq_along(rwy)) { # subset the data and merge them ar4rw = ar4rw - subset(arrgnd, arrgnd$Runway==rwy[j]) if(j == 1) { arrw = ar4rw } else { arrw = merge(arrw, ar4rw) } } I attach the data. I needed 500 rows to get both runways in rwy. The suggestions did not help much, but did get rid of the row of NAs in ar4rw. Why? When I run through the loop for 2 runways, I get # j = 1, Runway = 31L Browse[1] arrw[1:3,] DateTime Date month hour minute quarter weekday IATA ICAO Flight 552 1/1/09 23:03 2009-01-01 1 23 3 92 5 AA AAL AAL22 563 1/1/09 23:17 2009-01-01 1 23 17 93 5 DL DAL DAL242 565 1/1/09 23:24 2009-01-01 1 23 24 93 5 DL DAL DAL624 AircraftType Tail Arrived STA Runway FromTo Delay 552 B762 N329AA 23:03:35 23:10 * 31L* LAX /JFK 0 563 B763 N1611B 23:17:37 23:46 31L KATL /KJFK 0 565 B752 N654DL 23:24:04 23:48 31L LAS /JFK 0 Operator dq gw 552 AMERICAN AIRLINES 2009-01-01 92 1 563 DELTA AIR LINES 2009-01-01 93 1 565 DELTA AIR LINES 2009-01-01 93 1 # j = 2 Runway=31R Browse[1] ar4rw[1:3,] DateTime Date month hour minute quarter weekday IATA ICAO Flight 529 1/1/09 21:46 2009-01-01 1 21 46 87 5 TA TAI TAI570 530 1/1/09 21:48 2009-01-01 1 21 48 87 5 AA AAL AAL2018 531 1/1/09 21:50 2009-01-01 1 21 50 87 5 BA BAW BAW183 AircraftType Tail Arrived STA Runway FromTo Delay 529 A320 N496TA 21:46:58 22:30 * 31R* MSLP /KJFK 0 530 B752 N621AM 21:48:43 21:50 31R TLPL /JFK 0 531 B744 G-CIVI 21:50:26 22:50 31R EGLL /KJFK 0 Operator dq gw 529 TACA INTERNATIONAL AIRLINES 2009-01-01 87 1 530 AMERICAN AIRLINES 2009-01-01 87 1 531 BRITISH AIRWAYS 2009-01-01 87 1 # But the merge gives all NAs! ] arrw[1:3,] DateTime Date month hour minute quarter weekday IATA ICAO Flight NA NA NA NA NA NA NA NA NA NA NA NA.1 NA NA NA NA NA NA NA NA NA NA NA.2 NA NA NA NA NA NA NA NA NA NA AircraftType Tail Arrived STA Runway FromTo Delay Operator dq gw NA NA NA NA NA NA NA NA NA NA NA NA.1 NA NA NA NA NA NA NA NA NA NA NA.2 NA NA NA NA NA NA NA NA NA NA Thanks, Jim Rome On Feb 1, 2010, at 5:30 PM, David Winsemius wrote: On Feb 1, 2010, at 5:16 PM, James Rome wrote: Dear kind R helpers, I have a vector of runway names in rwy (31R, 31L,... the number is user selectable) arrgnd is a data frame with data for all flights and all runways, with a Runway column. I am trying to subset arrgnd into a dat frame for each selected runway, and then combine them back together using the following code: for (j in 1:nr) { # nr = number of user-selected runways Safer would be: for (j in seq_along(rwy) { ar4rw = arrgnd[arrgnd$Runway==rwy[j],] Clearer would be : ar4rw - subset(arrgnd, Runway= j) # and I think the NA line's will also disappear. ^ == ^ if (j == 1) { arrw = ar4rw } else { arrw = merge(arrw, ar4rw) } } You really should give us something like: dput(rwy) dput( head(arrgnd, 10) ) but, the merge step gives me a data frame with all NAs. In addition, ar4rw always gets a row with NAs at the start, which I do not understand. There are no rows with all NAs in the arrgnd data frame. ar4rw[1:2,] # first time through for 31R DateTime Date month hour minute quarter weekday IATA ICAO Flight NA NA NA NA NA NA NA NA NA NA NA 529 1/1/09 21:46 2009-01-01 1 21 46 87 5 TA TAI TAI570 AircraftType Tail Arrived STA Runway FromTo Delay NA NA NA NA NA NA NA NA 529 A320 N496TA 21:46:58 22:30 31R MSLP /KJFK 0 Operator dq gw NA NA NA NA 529 TACA INTERNATIONAL AIRLINES 2009-01-01 87 1 ar4rw[1:2,] # second time through for 31L DateTime Date month hour minute quarter weekday IATA ICAO Flight NA NA NA NA NA NA NA NA NA NA NA 552 1/1/09 23:03 2009-01-01 1 23 3 92 5 AA AAL AAL22 AircraftType Tail Arrived STA Runway FromTo Delay Operator NA NA NA NA NA NA NA NA NA 552 B762 N329AA 23:03:35 23:10 31L LAX /JFK 0 AMERICAN AIRLINES dq gw NA NA NA But after the merge, I get all NAs. What am I doing wrong? The data layout gets mangled and I cannot tell what rows are being matched to what. Use dput to convey an unambiguous, and easily replicated example. Thanks, Jim Rome 552 2009-01-01 92 1 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD Heritage Laboratories West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD Heritage Laboratories West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merging data frames gives all NAs
On 2/1/2010 5:51 PM, David Winsemius wrote: I figured this out finally. I really believe that the R help write-ups are sorely lacking. As soon as I looked at http://www.statmethods.net/management/merging.html, it was obvious: Adding Columns To merge two dataframes (datasets) horizontally, use the *merge* function. In most cases, you join two dataframes by one or more common key variables (i.e., an inner join). |# merge two dataframes by ID total - merge(dataframeA,dataframeB,by=ID)| |# merge two dataframes by ID and Country total - merge(dataframeA,dataframeB,by=c(ID,Country)) | Adding Rows To join two dataframes (datasets) vertically, use the* rbind* function. The two dataframes *must* have the same variables, but they do not have to be in the same order. |total - rbind(dataframeA, dataframeB) | I needed to add rows, and had to use rbind. If the help for merge said To merge two dataframes (datasets) horizontally I would have known right away that it was the wrong function to use. Thanks for the help, Jim Rome On Feb 1, 2010, at 5:30 PM, David Winsemius wrote: On Feb 1, 2010, at 5:16 PM, James Rome wrote: Dear kind R helpers, I have a vector of runway names in rwy (31R, 31L,... the number is user selectable) arrgnd is a data frame with data for all flights and all runways, with a Runway column. I am trying to subset arrgnd into a dat frame for each selected runway, and then combine them back together using the following code: for (j in 1:nr) {# nr = number of user-selected runways Safer would be: for (j in seq_along(rwy) { ar4rw = arrgnd[arrgnd$Runway==rwy[j],] Clearer would be : ar4rw - subset(arrgnd, Runway= j) # and I think the NA line's will also disappear. ^ == ^ if (j == 1) { arrw = ar4rw } else { arrw = merge(arrw, ar4rw) } } You really should give us something like: dput(rwy) dput( head(arrgnd, 10) ) but, the merge step gives me a data frame with all NAs. In addition, ar4rw always gets a row with NAs at the start, which I do not understand. There are no rows with all NAs in the arrgnd data frame. ar4rw[1:2,] # first time through for 31R DateTime Date month hour minute quarter weekday IATA ICAO Flight NA NA NANA NA NA NA NA NA NA NA 529 1/1/09 21:46 2009-01-01 1 21 46 87 5 TA TAI TAI570 AircraftType Tail Arrived STA Runway FromTo Delay NA NA NA NA NA NA NANA 529 A320 N496TA 21:46:58 22:3031R MSLP /KJFK 0 Operatordq gw NA NA NA NA 529 TACA INTERNATIONAL AIRLINES 2009-01-01 87 1 ar4rw[1:2,] # second time through for 31L DateTime Date month hour minute quarter weekday IATA ICAO Flight NA NA NANA NA NA NA NA NA NA NA 552 1/1/09 23:03 2009-01-01 1 23 3 92 5 AA AAL AAL22 AircraftType Tail Arrived STA RunwayFromTo Delay Operator NA NA NA NA NA NA NANA NA 552 B762 N329AA 23:03:35 23:1031L LAX /JFK 0 AMERICAN AIRLINES dq gw NA NA NA But after the merge, I get all NAs. What am I doing wrong? The data layout gets mangled and I cannot tell what rows are being matched to what. Use dput to convey an unambiguous, and easily replicated example. Thanks, Jim Rome 552 2009-01-01 92 1 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD Heritage Laboratories West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD Heritage Laboratories West Hartford, CT [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merging data frames gives all NAs
James Rome wrote: On 2/1/2010 5:51 PM, David Winsemius wrote: I figured this out finally. I really believe that the R help write-ups are sorely lacking. The help docs are probably not the best way to learn R, but they are great for users of the functions. I have found that after going through an introduction book on R or online tutorial (plus experience), that the help system in R is really, really good at *documenting the behavior of the functions*, which is the point of them. As another general hint from someone who has learned R slowly over time, when something happens that you don't understand on real data, construct a minimal example data.frame and try out your code on that. Also, learning how to use browser() or the debug package has been very useful. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merging data frames gives all NAs
I agree. I have a foot of books on R now, for example the R Book by Michael Crowly. But so far, Googling the archives of this list has been the most help. Nonetheless, if I cannot understand the documentation of a function, then the documentation needs to be updated. For example, there needs to be a Returns section at the top of every function, so one can see what type of thing the function returns. Merge() needs to start with To merge two dataframes (datasets) horizontally, use the *merge* function. rather than Merge two data frames by common columns or row names, or do other versions of database /join/ operations which does not at all say that it does a horizontal merge if one does not know SQL. I do know SQL, and it is still not clear to me. And the/ merge/ documentation should then refer users to/ rbind/ for vertical merges. I hope that someone on the list can take actually change this file for the benefit of others. Thanks, Jim On 2/2/2010 2:00 PM, Erik Iverson wrote: James Rome wrote: On 2/1/2010 5:51 PM, David Winsemius wrote: I figured this out finally. I really believe that the R help write-ups are sorely lacking. The help docs are probably not the best way to learn R, but they are great for users of the functions. I have found that after going through an introduction book on R or online tutorial (plus experience), that the help system in R is really, really good at *documenting the behavior of the functions*, which is the point of them. As another general hint from someone who has learned R slowly over time, when something happens that you don't understand on real data, construct a minimal example data.frame and try out your code on that. Also, learning how to use browser() or the debug package has been very useful. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merging data frames gives all NAs
Yeah, sometimes the vocabulary we bring to a task does not match up (or merge properly) with the vocabulary that the developers use. In this case the merge operation is one that has a precise meaning in database lingo, which apparently you do not have background in. My experience in trying to append objects ran into similar frustrations early in my R endeavors. For the life of me, I could not find any instances of append in the index of the references I was using. I am glad that you found that material helpful, but I think its use of the terms join or merge are incorrect in a database framework as well, so I do not think it could be used as an unambiguous guide. Your use of combine was likewise ambiguous. In composing questions to R- help, it is advised that you post a small example and illustrate what you want to see as a result. -- David. On Feb 2, 2010, at 1:47 PM, James Rome wrote: On 2/1/2010 5:51 PM, David Winsemius wrote: I figured this out finally. I really believe that the R help write- ups are sorely lacking. You should ponder whether you actually know enough to criticize the help page when it describes the merge function as performing database join operations. My guess is that you don't. The help page are not to be designed to teach basic computer programming concepts. As soon as I looked at http://www.statmethods.net/management/merging.html , it was obvious: Adding Columns To merge two dataframes (datasets) horizontally, use the merge function. In most cases, you join two dataframes by one or more common key variables (i.e., an inner join). # merge two dataframes by ID total - merge(dataframeA,dataframeB,by=ID) # merge two dataframes by ID and Country total - merge(dataframeA,dataframeB,by=c(ID,Country)) Adding Rows To join two dataframes (datasets) vertically, use the rbind function. The two dataframes must have the same variables, but they do not have to be in the same order. total - rbind(dataframeA, dataframeB) I needed to add rows, and had to use rbind. If the help for merge said To merge two dataframes (datasets) horizontally I would have known right away that it was the wrong function to use. Thanks for the help, Jim Rome On Feb 1, 2010, at 5:30 PM, David Winsemius wrote: On Feb 1, 2010, at 5:16 PM, James Rome wrote: Dear kind R helpers, I have a vector of runway names in rwy (31R, 31L,... the number is user selectable) arrgnd is a data frame with data for all flights and all runways, with a Runway column. I am trying to subset arrgnd into a dat frame for each selected runway, and then combine them back together using the following code: for (j in 1:nr) {# nr = number of user-selected runways Safer would be: for (j in seq_along(rwy) { ar4rw = arrgnd[arrgnd$Runway==rwy[j],] Clearer would be : ar4rw - subset(arrgnd, Runway= j) # and I think the NA line's will also disappear. ^ == ^ if (j == 1) { arrw = ar4rw } else { arrw = merge(arrw, ar4rw) } } You really should give us something like: dput(rwy) dput( head(arrgnd, 10) ) but, the merge step gives me a data frame with all NAs. In addition, ar4rw always gets a row with NAs at the start, which I do not understand. There are no rows with all NAs in the arrgnd data frame. ar4rw[1:2,] # first time through for 31R DateTime Date month hour minute quarter weekday IATA ICAO Flight NA NA NANA NA NA NA NA NA NA NA 529 1/1/09 21:46 2009-01-01 1 21 46 87 5 TA TAI TAI570 AircraftType Tail Arrived STA Runway FromTo Delay NA NA NA NA NA NA NANA 529 A320 N496TA 21:46:58 22:3031R MSLP /KJFK 0 Operatordq gw NA NA NA NA 529 TACA INTERNATIONAL AIRLINES 2009-01-01 87 1 ar4rw[1:2,] # second time through for 31L DateTime Date month hour minute quarter weekday IATA ICAO Flight NA NA NANA NA NA NA NA NA NA NA 552 1/1/09 23:03 2009-01-01 1 23 3 92 5 AA AAL AAL22 AircraftType Tail Arrived STA RunwayFromTo Delay Operator NA NA NA NA NA NA NANA NA 552 B762 N329AA 23:03:35 23:1031L LAX /JFK 0 AMERICAN AIRLINES dq gw NA NA NA But after the merge, I get all NAs. What am I doing wrong? The data layout gets mangled and I cannot tell what rows are being matched to what. Use dput to convey an unambiguous, and easily replicated example. Thanks, Jim Rome 552 2009-01-01 92 1 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD Heritage Laboratories West Hartford, CT
[R] merging data frames gives all NAs
Dear kind R helpers, I have a vector of runway names in rwy (31R, 31L,... the number is user selectable) arrgnd is a data frame with data for all flights and all runways, with a Runway column. I am trying to subset arrgnd into a dat frame for each selected runway, and then combine them back together using the following code: for (j in 1:nr) {# nr = number of user-selected runways ar4rw = arrgnd[arrgnd$Runway==rwy[j],] if (j == 1) { arrw = ar4rw } else { arrw = merge(arrw, ar4rw) } } but, the merge step gives me a data frame with all NAs. In addition, ar4rw always gets a row with NAs at the start, which I do not understand. There are no rows with all NAs in the arrgnd data frame. ar4rw[1:2,] # first time through for 31R DateTime Date month hour minute quarter weekday IATA ICAO Flight NA NA NANA NA NA NA NA NA NA NA 529 1/1/09 21:46 2009-01-01 1 21 46 87 5 TA TAI TAI570 AircraftType Tail Arrived STA Runway FromTo Delay NA NA NA NA NA NA NANA 529 A320 N496TA 21:46:58 22:3031R MSLP /KJFK 0 Operatordq gw NA NA NA NA 529 TACA INTERNATIONAL AIRLINES 2009-01-01 87 1 ar4rw[1:2,] # second time through for 31L DateTime Date month hour minute quarter weekday IATA ICAO Flight NA NA NANA NA NA NA NA NA NA NA 552 1/1/09 23:03 2009-01-01 1 23 3 92 5 AA AAL AAL22 AircraftType Tail Arrived STA RunwayFromTo Delay Operator NA NA NA NA NA NA NANA NA 552 B762 N329AA 23:03:35 23:1031L LAX /JFK 0 AMERICAN AIRLINES dq gw NA NA NA But after the merge, I get all NAs. What am I doing wrong? Thanks, Jim Rome 552 2009-01-01 92 1 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merging data frames gives all NAs
On Feb 1, 2010, at 5:16 PM, James Rome wrote: Dear kind R helpers, I have a vector of runway names in rwy (31R, 31L,... the number is user selectable) arrgnd is a data frame with data for all flights and all runways, with a Runway column. I am trying to subset arrgnd into a dat frame for each selected runway, and then combine them back together using the following code: for (j in 1:nr) {# nr = number of user-selected runways Safer would be: for (j in seq_along(rwy) { ar4rw = arrgnd[arrgnd$Runway==rwy[j],] Clearer would be : ar4rw - subset(arrgnd, Runway=j) # and I think the NA line's will also disappear. if (j == 1) { arrw = ar4rw } else { arrw = merge(arrw, ar4rw) } } You really should give us something like: dput(rwy) dput( head(arrgnd, 10) ) but, the merge step gives me a data frame with all NAs. In addition, ar4rw always gets a row with NAs at the start, which I do not understand. There are no rows with all NAs in the arrgnd data frame. ar4rw[1:2,] # first time through for 31R DateTime Date month hour minute quarter weekday IATA ICAO Flight NA NA NANA NA NA NA NA NA NA NA 529 1/1/09 21:46 2009-01-01 1 21 46 87 5 TA TAI TAI570 AircraftType Tail Arrived STA Runway FromTo Delay NA NA NA NA NA NA NANA 529 A320 N496TA 21:46:58 22:3031R MSLP /KJFK 0 Operatordq gw NA NA NA NA 529 TACA INTERNATIONAL AIRLINES 2009-01-01 87 1 ar4rw[1:2,] # second time through for 31L DateTime Date month hour minute quarter weekday IATA ICAO Flight NA NA NANA NA NA NA NA NA NA NA 552 1/1/09 23:03 2009-01-01 1 23 3 92 5 AA AAL AAL22 AircraftType Tail Arrived STA RunwayFromTo Delay Operator NA NA NA NA NA NA NANA NA 552 B762 N329AA 23:03:35 23:1031L LAX /JFK 0 AMERICAN AIRLINES dq gw NA NA NA But after the merge, I get all NAs. What am I doing wrong? The data layout gets mangled and I cannot tell what rows are being matched to what. Use dput to convey an unambiguous, and easily replicated example. Thanks, Jim Rome 552 2009-01-01 92 1 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD Heritage Laboratories West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] merging data frames with matrix objects when missing cases
Hi, I have faced a problem with the merge() function when trying to merge two data frames that have a common index but the second one does not have cases for all indexes in the first one. With usual variables R fills in the missing cases with NA if all=T is requested. But if the variable is a matrix R seems to insert NA only to the first column of the matrix and fill in the rest of the columns by recycling the values. Here is a toy example: df1-data.frame(a=1:3,X1=I(matrix(1:6,ncol=2))) df2-data.frame(a=1:2,X2=I(matrix(11:14,ncol=2))) merge(df1,df2) a X1.1 X1.2 X2.1 X2.2 1 114 11 13 2 225 12 14 # no all=T, missing cases are dropped merge(df1,df2,all=T) a X1.1 X1.2 X2.1 X2.2 1 114 11 13 2 225 12 14 3 336 NA 13 # X2.1 set to NA correctly but X2.2 set to 13 by recycling. Can I somehow get the behaviour that the third row of the second matrix X2 in the above example would be filled with NA for all columns? None of the merge() options does not seem to provide a solution. regards, Kari __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merging data frames with matrix objects when missing cases
This has something to do with your data.frame structure see str(df1) 'data.frame': 3 obs. of 2 variables: $ a : int 1 2 3 $ X1: 'AsIs' int [1:3, 1:2] 1 2 3 4 5 6 str(df2) 'data.frame': 2 obs. of 2 variables: $ a : int 1 2 $ X2: 'AsIs' int [1:2, 1:2] 11 12 13 14 This seems to work df1-data.frame(a=1:3, b = 1:3, c = 4:6) str(df1) 'data.frame': 3 obs. of 3 variables: $ a: int 1 2 3 $ b: int 1 2 3 $ c: int 4 5 6 df2-data.frame(a=1:2, d = 11:12, e = 13:14) str(df2) 'data.frame': 2 obs. of 3 variables: $ a: int 1 2 $ d: int 11 12 $ e: int 13 14 merge(df1,df2) a b c d e 1 1 1 4 11 13 2 2 2 5 12 14 merge(df1, df2, all=T) a b c d e 1 1 1 4 11 13 2 2 2 5 12 14 3 3 3 6 NA NA 2009/9/18 Kari Ruohonen kari.ruoho...@utu.fi: Hi, I have faced a problem with the merge() function when trying to merge two data frames that have a common index but the second one does not have cases for all indexes in the first one. With usual variables R fills in the missing cases with NA if all=T is requested. But if the variable is a matrix R seems to insert NA only to the first column of the matrix and fill in the rest of the columns by recycling the values. Here is a toy example: df1-data.frame(a=1:3,X1=I(matrix(1:6,ncol=2))) df2-data.frame(a=1:2,X2=I(matrix(11:14,ncol=2))) merge(df1,df2) a X1.1 X1.2 X2.1 X2.2 1 1 1 4 11 13 2 2 2 5 12 14 # no all=T, missing cases are dropped merge(df1,df2,all=T) a X1.1 X1.2 X2.1 X2.2 1 1 1 4 11 13 2 2 2 5 12 14 3 3 3 6 NA 13 # X2.1 set to NA correctly but X2.2 set to 13 by recycling. Can I somehow get the behaviour that the third row of the second matrix X2 in the above example would be filled with NA for all columns? None of the merge() options does not seem to provide a solution. regards, Kari __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merging data frames with matrix objects when missing cases
Yes, that was the original question: when a variable in a data frame is a matrix instead of an ordinary variable merge() handles the missing cases so that only the first column of the matrix gets NA and the rest are recycled. If the matrix is broken to several variables everything works fine. Why then have a matrix in a data frame as a variable? In chemometrics, for example, it is usual to have e.g. NIR spectra stored in the data frame in this way. This eases the use of such spectra as a predictor in the model formula (may contain hundreds of variables depending on the wavelength binning used). It is also helpful in grouping variables in a data frame to different predictor sets. See examples in the pls package. There is a workout by searching the NA for the first column and setting all other columns on that row NA as well. But my question was more like a caution about the unexpected behaviour that someone could consider as an unwished feature. Kari On Fri, 2009-09-18 at 20:41 +0300, johannes rara wrote: This has something to do with your data.frame structure see str(df1) 'data.frame': 3 obs. of 2 variables: $ a : int 1 2 3 $ X1: 'AsIs' int [1:3, 1:2] 1 2 3 4 5 6 str(df2) 'data.frame': 2 obs. of 2 variables: $ a : int 1 2 $ X2: 'AsIs' int [1:2, 1:2] 11 12 13 14 This seems to work df1-data.frame(a=1:3, b = 1:3, c = 4:6) str(df1) 'data.frame': 3 obs. of 3 variables: $ a: int 1 2 3 $ b: int 1 2 3 $ c: int 4 5 6 df2-data.frame(a=1:2, d = 11:12, e = 13:14) str(df2) 'data.frame': 2 obs. of 3 variables: $ a: int 1 2 $ d: int 11 12 $ e: int 13 14 merge(df1,df2) a b c d e 1 1 1 4 11 13 2 2 2 5 12 14 merge(df1, df2, all=T) a b c d e 1 1 1 4 11 13 2 2 2 5 12 14 3 3 3 6 NA NA 2009/9/18 Kari Ruohonen kari.ruoho...@utu.fi: Hi, I have faced a problem with the merge() function when trying to merge two data frames that have a common index but the second one does not have cases for all indexes in the first one. With usual variables R fills in the missing cases with NA if all=T is requested. But if the variable is a matrix R seems to insert NA only to the first column of the matrix and fill in the rest of the columns by recycling the values. Here is a toy example: df1-data.frame(a=1:3,X1=I(matrix(1:6,ncol=2))) df2-data.frame(a=1:2,X2=I(matrix(11:14,ncol=2))) merge(df1,df2) a X1.1 X1.2 X2.1 X2.2 1 114 11 13 2 225 12 14 # no all=T, missing cases are dropped merge(df1,df2,all=T) a X1.1 X1.2 X2.1 X2.2 1 114 11 13 2 225 12 14 3 336 NA 13 # X2.1 set to NA correctly but X2.2 set to 13 by recycling. Can I somehow get the behaviour that the third row of the second matrix X2 in the above example would be filled with NA for all columns? None of the merge() options does not seem to provide a solution. regards, Kari __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merging data frames, or one column/vector with a data frame filling out empty rows with NA's
You are exceeding your max memory here, so R will not be able to do that. dump both tables into a db such as mysql and then run the query either from RMySQL or from mysql directly. then output the result and import back in R. that will take care of the merge, but not sure what will happen when you actually try to run some stats on the object. it is very likely the operation will exceed memory again. in the end you may have to write your own code which does not attempt to load everything in memory, it could be either R or a lower level language. if you have SAS it will probably work as it deals with large sets in long format well. depending on what you do R may be able to deal with it after a reshape() to a wide format. joe1985 wrote: Hello I have two data frames, SNP4 and SNP1: head(SNP4) Animal MarkerY 3213 194073197 P1001 0.021088 1295 194073197 P1002 0.021088 915 194073197 P1004 0.021088 2833 194073197 P1005 0.021088 1487 194073197 P1006 0.021088 1885 194073197 P1007 0.021088 head(SNP1) AnimalMarker x 3213 194073197 P1001 2 1295 194073197 P1002 1 915 194073197 P1004 2 2833 194073197 P1005 0 1487 194073197 P1006 2 1885 194073197 P1007 0 I want these two data frames merged by 'Marker', but when i try SNP5 - merge(SNP4, SNP1, by = 'Marker', all = TRUE) Error: cannot allocate vector of size 2.4 Gb In addition: Warning messages: 1: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) 2: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) 3: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) 4: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) And error occurs. What i want is the column SNP1$x merged together with SNP4 by Marker, so some markers will have NA's in the 'x'-column in the SNP5 dataset. I also tried this SNP5 - merge(SNP4, SNP1$x, by.x = 'Marker', by.y = 'Marker', all = TRUE) Error in fix.by(by.y, y) : 'by' must specify valid column(s) I won't work either. Does anyone have any idea how to solve this. Regards, Johannes. -- View this message in context: http://www.nabble.com/Merging-data-frames%2C-or-one-column-vector-with-a-data-frame-filling-out-empty-rows-with-NA%27s-tp23171110p23259062.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Merging data frames, or one column/vector with a data frame filling out empty rows with NA's
Hello I have two data frames, SNP4 and SNP1: head(SNP4) Animal MarkerY 3213 194073197 P1001 0.021088 1295 194073197 P1002 0.021088 915 194073197 P1004 0.021088 2833 194073197 P1005 0.021088 1487 194073197 P1006 0.021088 1885 194073197 P1007 0.021088 head(SNP1) AnimalMarker x 3213 194073197 P1001 2 1295 194073197 P1002 1 915 194073197 P1004 2 2833 194073197 P1005 0 1487 194073197 P1006 2 1885 194073197 P1007 0 I want these two data frames merged by 'Marker', but when i try SNP5 - merge(SNP4, SNP1, by = 'Marker', all = TRUE) Error: cannot allocate vector of size 2.4 Gb In addition: Warning messages: 1: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) 2: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) 3: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) 4: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) And error occurs. What i want is the column SNP1$x merged together with SNP4 by Marker, so some markers will have NA's in the 'x'-column in the SNP5 dataset. I also tried this SNP5 - merge(SNP4, SNP1$x, by.x = 'Marker', by.y = 'Marker', all = TRUE) Error in fix.by(by.y, y) : 'by' must specify valid column(s) I won't work either. Does anyone have any idea how to solve this. Regards, Johannes. -- View this message in context: http://www.nabble.com/Merging-data-frames%2C-or-one-column-vector-with-a-data-frame-filling-out-empty-rows-with-NA%27s-tp23171110p23171110.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Merging data frames, or one column/vector with a data frame filling out empty rows with NA's
Hello I have two data frames, SNP4 and SNP1: head(SNP4) Animal MarkerY 3213 194073197 P1001 0.021088 1295 194073197 P1002 0.021088 915 194073197 P1004 0.021088 2833 194073197 P1005 0.021088 1487 194073197 P1006 0.021088 1885 194073197 P1007 0.021088 head(SNP1) AnimalMarker x 3213 194073197 P1001 2 1295 194073197 P1002 1 915 194073197 P1004 2 2833 194073197 P1005 0 1487 194073197 P1006 2 1885 194073197 P1007 0 I want these two data frames merged by 'Marker', but when i try SNP5 - merge(SNP4, SNP1, by = 'Marker', all = TRUE) Error: cannot allocate vector of size 2.4 Gb In addition: Warning messages: 1: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) 2: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) 3: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) 4: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) And error occurs. What i want is the column SNP1$x merged together with SNP4 by Marker, so some markers will have NA's in the 'x'-column in the SNP5 dataset. I also tried this SNP5 - merge(SNP4, SNP1$x, by.x = 'Marker', by.y = 'Marker', all = TRUE) Error in fix.by(by.y, y) : 'by' must specify valid column(s) I won't work either. Does anyone have any idea how to solve this. Regards, Johannes. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merging data frames, or one column/vector with a data frame filling out empty rows with NA's
Try this (where SNP1x is same as SNP1 from your post but without the last line). If the merge below does not work on real data set due to size then try the sqldf alternative as it SNP1x - + structure(list(Animal = c(194073197L, 194073197L, 194073197L, + 194073197L, 194073197L), Marker = structure(1:5, .Label = c(P1001, + P1002, P1004, P1005, P1006, P1007), class = factor), + x = c(2L, 1L, 2L, 0L, 2L)), .Names = c(Animal, Marker, + x), row.names = c(3213, 1295, 915, 2833, 1487), class = data.frame) SNP4 - + structure(list(Animal = c(194073197L, 194073197L, 194073197L, + 194073197L, 194073197L, 194073197L), Marker = structure(1:6, .Label = c(P1001, + P1002, P1004, P1005, P1006, P1007), class = factor), + Y = c(0.021088, 0.021088, 0.021088, 0.021088, 0.021088, 0.021088 + )), .Names = c(Animal, Marker, Y), class = data.frame, row.names = c(3213, + 1295, 915, 2833, 1487, 1885)) merge(SNP1x, SNP4, all = TRUE) Animal Marker xY 1 194073197 P1001 2 0.021088 2 194073197 P1002 1 0.021088 3 194073197 P1004 2 0.021088 4 194073197 P1005 0 0.021088 5 194073197 P1006 2 0.021088 6 194073197 P1007 NA 0.021088 library(sqldf) sqldf(select * from SNP4 left join SNP1x using (Animal, Marker)) Animal MarkerY x 1 194073197 P1001 0.021088 2 2 194073197 P1002 0.021088 1 3 194073197 P1004 0.021088 2 4 194073197 P1005 0.021088 0 5 194073197 P1006 0.021088 2 6 194073197 P1007 0.021088 NA # or if that does not work due to size force it to create, use #and destroy an external data base sqldf(select * from SNP4 left join SNP1x using (Animal, Marker), dbname = temp.db) Animal MarkerY x 1 194073197 P1001 0.021088 2 2 194073197 P1002 0.021088 1 3 194073197 P1004 0.021088 2 4 194073197 P1005 0.021088 0 5 194073197 P1006 0.021088 2 6 194073197 P1007 0.021088 NA On Wed, Apr 22, 2009 at 5:22 AM, Johannes G. Madsen j...@dansksvineproduktion.dk wrote: Hello I have two data frames, SNP4 and SNP1: head(SNP4) Animal Marker Y 3213 194073197 P1001 0.021088 1295 194073197 P1002 0.021088 915 194073197 P1004 0.021088 2833 194073197 P1005 0.021088 1487 194073197 P1006 0.021088 1885 194073197 P1007 0.021088 head(SNP1) Animal Marker x 3213 194073197 P1001 2 1295 194073197 P1002 1 915 194073197 P1004 2 2833 194073197 P1005 0 1487 194073197 P1006 2 1885 194073197 P1007 0 I want these two data frames merged by 'Marker', but when i try SNP5 - merge(SNP4, SNP1, by = 'Marker', all = TRUE) Error: cannot allocate vector of size 2.4 Gb In addition: Warning messages: 1: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) 2: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) 3: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) 4: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) And error occurs. What i want is the column SNP1$x merged together with SNP4 by Marker, so some markers will have NA's in the 'x'-column in the SNP5 dataset. I also tried this SNP5 - merge(SNP4, SNP1$x, by.x = 'Marker', by.y = 'Marker', all = TRUE) Error in fix.by(by.y, y) : 'by' must specify valid column(s) I won't work either. Does anyone have any idea how to solve this. Regards, Johannes. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merging data frames, or one column/vector with a data frame filling out empty rows with NA's
On Apr 22, 2009, at 5:22 AM, Johannes G. Madsen wrote: Hello I have two data frames, SNP4 and SNP1: head(SNP4) Animal MarkerY 3213 194073197 P1001 0.021088 1295 194073197 P1002 0.021088 915 194073197 P1004 0.021088 2833 194073197 P1005 0.021088 1487 194073197 P1006 0.021088 1885 194073197 P1007 0.021088 head(SNP1) AnimalMarker x 3213 194073197 P1001 2 1295 194073197 P1002 1 915 194073197 P1004 2 2833 194073197 P1005 0 1487 194073197 P1006 2 1885 194073197 P1007 0 I want these two data frames merged by 'Marker', but when i try SNP5 - merge(SNP4, SNP1, by = 'Marker', all = TRUE) Error: cannot allocate vector of size 2.4 Gb In addition: Warning messages: 1: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) 2: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) 3: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) 4: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) And error occurs. So what are the results of: str(SNP4) ; str(SNP1)# this will tell us how large these objects are. And are you sure you don't want the merge to occur by Animal as well? What i want is the column SNP1$x merged together with SNP4 by Marker, so some markers will have NA's in the 'x'-column in the SNP5 dataset. I also tried this SNP5 - merge(SNP4, SNP1$x, by.x = 'Marker', by.y = 'Marker', all = TRUE) Error in fix.by(by.y, y) : 'by' must specify valid column(s) I won't work either. Does anyone have any idea how to solve this. The second error seems pretty obvious. You are trying to merge a vector that has no longer any Marker with a dataframe that does. Regards, Johannes. David Winsemius, MD Heritage Laboratories West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merging data frames, or one column/vector with a data frame filling out empty rows with NA's
Hi, How about this: SNP5 - merge(SNP4, SNP1[,2:3], all.x=TRUE) SNP5 MarkerAnimal Y x 1 P1001 194073197 0.021088 2 2 P1002 194073197 0.021088 1 3 P1004 194073197 0.021088 2 4 P1005 194073197 0.021088 0 5 P1006 194073197 0.021088 2 6 P1007 194073197 0.021088 0 This ignores Animal, and that may or may not be what you want - it wasn't clear from your question. But your error is due to memory limitations - could be due to specifying the wrong merge, or to having files larger than your computer can handle. This is a good job for a proper database. SNP5 - merge(SNP4, SNP1$x, by.x = 'Marker', by.y = 'Marker', all = TRUE) Error in fix.by(by.y, y) : 'by' must specify valid column(s) If you just include SNP1$x, there is no Marker column to merge on. You need to include at least two columns. On Wed, Apr 22, 2009 at 3:30 AM, joe1985 johan...@dsr.life.ku.dk wrote: Hello I have two data frames, SNP4 and SNP1: head(SNP4) Animal Marker Y 3213 194073197 P1001 0.021088 1295 194073197 P1002 0.021088 915 194073197 P1004 0.021088 2833 194073197 P1005 0.021088 1487 194073197 P1006 0.021088 1885 194073197 P1007 0.021088 head(SNP1) Animal Marker x 3213 194073197 P1001 2 1295 194073197 P1002 1 915 194073197 P1004 2 2833 194073197 P1005 0 1487 194073197 P1006 2 1885 194073197 P1007 0 I want these two data frames merged by 'Marker', but when i try SNP5 - merge(SNP4, SNP1, by = 'Marker', all = TRUE) Error: cannot allocate vector of size 2.4 Gb In addition: Warning messages: 1: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) 2: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) 3: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) 4: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : Reached total allocation of 1535Mb: see help(memory.size) And error occurs. What i want is the column SNP1$x merged together with SNP4 by Marker, so some markers will have NA's in the 'x'-column in the SNP5 dataset. I also tried this SNP5 - merge(SNP4, SNP1$x, by.x = 'Marker', by.y = 'Marker', all = TRUE) Error in fix.by(by.y, y) : 'by' must specify valid column(s) I won't work either. Does anyone have any idea how to solve this. Regards, Johannes. -- Sarah Goslee http://www.functionaldiversity.org __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merging data frames of different length
Thanks a lot, Gabor - it's perfect! Dimitri On Fri, Dec 19, 2008 at 6:24 PM, Gabor Grothendieck ggrothendi...@gmail.com wrote: Try this: L - list(data.frame(A=2, B=3, C=4), + data.frame(A=2, B=1, C=3, D=2, E=4, F=5), + data.frame(A=1, B=2, C=4, D=3, E=2, F=4, G=5, H=4, I=2)) library(plyr) do.call(rbind.fill, L) A B C D E F G H I 1 2 3 4 NA NA NA NA NA NA 2 2 1 3 2 4 5 NA NA NA 3 1 2 4 3 2 4 5 4 2 On Fri, Dec 19, 2008 at 5:48 PM, Dimitri Liakhovitski ld7...@gmail.com wrote: Hello, everyone! I have list L that contains 99 data frames. All data frames have only one row, but a different number of columns. Some data frames have 3 columns, some - 6 columns, and some - 9 columns. The names of the first 3 columns are identical in all 99 data frames (e.g., A, B, and C). The names of columns 4:6 are identical in data frames that contain 6 and 9 columns (e.g., D, E, and F). So that L looks like this: L[[1]] A B C 2 3 4 L[[2]] A B C D E F 2 1 3 2 4 5 L[[3]] A B C D E F G H I 1 2 4 3 2 4 5 4 2 L[[4]] ... How can I merge all of those data frames into one large data frame - with 99 rows - such that all data are in the columns with correct names. Of course, I'd like the rows of the new large data frame that contain the data for less than 9 columns to have NAs in columns 4:9 (or 7:9). In other words, I want the first 3 rows of the new large data frame to look like this: A B C D E F G H I 2 3 4 NA NA NA NA NA NA 2 1 3 2 4 5 NA NA NA 1 2 4 3 2 4 5 4 2 Ideally, I'd like this merge to work for ANY number of individual small data frames in L - even if their total number within L is unknown. I tried merge - but it seems to me that it only works for 2 data frames, not for many. Thank you very much! -- Dimitri Liakhovitski MarketTools, Inc. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Dimitri Liakhovitski MarketTools, Inc. dimitri.liakhovit...@markettools.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Merging data frames of different length
Hello, everyone! I have list L that contains 99 data frames. All data frames have only one row, but a different number of columns. Some data frames have 3 columns, some - 6 columns, and some - 9 columns. The names of the first 3 columns are identical in all 99 data frames (e.g., A, B, and C). The names of columns 4:6 are identical in data frames that contain 6 and 9 columns (e.g., D, E, and F). So that L looks like this: L[[1]] A B C 2 3 4 L[[2]] A B C D E F 2 1 3 2 4 5 L[[3]] A B C D E F G H I 1 2 4 3 2 4 5 4 2 L[[4]] ... How can I merge all of those data frames into one large data frame - with 99 rows - such that all data are in the columns with correct names. Of course, I'd like the rows of the new large data frame that contain the data for less than 9 columns to have NAs in columns 4:9 (or 7:9). In other words, I want the first 3 rows of the new large data frame to look like this: A B C D E F G H I 2 3 4 NA NA NA NA NA NA 2 1 3 2 4 5 NA NA NA 1 2 4 3 2 4 5 4 2 Ideally, I'd like this merge to work for ANY number of individual small data frames in L - even if their total number within L is unknown. I tried merge - but it seems to me that it only works for 2 data frames, not for many. Thank you very much! -- Dimitri Liakhovitski MarketTools, Inc. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merging data frames of different length
Try this: L - list(data.frame(A=2, B=3, C=4), + data.frame(A=2, B=1, C=3, D=2, E=4, F=5), + data.frame(A=1, B=2, C=4, D=3, E=2, F=4, G=5, H=4, I=2)) library(plyr) do.call(rbind.fill, L) A B C D E F G H I 1 2 3 4 NA NA NA NA NA NA 2 2 1 3 2 4 5 NA NA NA 3 1 2 4 3 2 4 5 4 2 On Fri, Dec 19, 2008 at 5:48 PM, Dimitri Liakhovitski ld7...@gmail.com wrote: Hello, everyone! I have list L that contains 99 data frames. All data frames have only one row, but a different number of columns. Some data frames have 3 columns, some - 6 columns, and some - 9 columns. The names of the first 3 columns are identical in all 99 data frames (e.g., A, B, and C). The names of columns 4:6 are identical in data frames that contain 6 and 9 columns (e.g., D, E, and F). So that L looks like this: L[[1]] A B C 2 3 4 L[[2]] A B C D E F 2 1 3 2 4 5 L[[3]] A B C D E F G H I 1 2 4 3 2 4 5 4 2 L[[4]] ... How can I merge all of those data frames into one large data frame - with 99 rows - such that all data are in the columns with correct names. Of course, I'd like the rows of the new large data frame that contain the data for less than 9 columns to have NAs in columns 4:9 (or 7:9). In other words, I want the first 3 rows of the new large data frame to look like this: A B C D E F G H I 2 3 4 NA NA NA NA NA NA 2 1 3 2 4 5 NA NA NA 1 2 4 3 2 4 5 4 2 Ideally, I'd like this merge to work for ANY number of individual small data frames in L - even if their total number within L is unknown. I tried merge - but it seems to me that it only works for 2 data frames, not for many. Thank you very much! -- Dimitri Liakhovitski MarketTools, Inc. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] merging data frames
Dear group, I have 3 different data frames. I want to merge all 3 data frames for which there is intersection. Say DF 1 and DF2 has 100 common elements in Column 1. DF3 does not have many intersection either with DF1 or with DF2. For names in column 1 not present in DF3 I want to introduce NA. DF1: Name Age A 21 B 45 C 30 DF2: Name Age A 50 B 20 X 10 DF3: Name Age B 40 Y 21 K 30 I want to merge all 3 into one: Df4: Name.1Age.1 Age.2 Age.3 A 21 50 NA B 45 20 40 C 30 NA NA K NA NA 30 X NA 10 NA Y NA NA 21 Could any one help me how can I merge 3 dataframes. appreciate your help. Thank you. srini __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merging data frames
DF1- data.frame(Name=as.factor(c(A,B,C)), Age= c(21,45,30)) DF2- data.frame(Name=as.factor(c(A,B,X)), Age= c(50,20,10)) DF3- data.frame(Name=as.factor(c(B,Y,K)), Age= c(40,21,30)) merge(merge(DF1,DF2, by.x= Name, by.y=Name, all=TRUE),DF3,by.x=Name,by.y=Name, all=TRUE); Name Age.x Age.y Age 1A2150 NA 2B4520 40 3C30NA NA 4XNA10 NA 5KNANA 30 6YNANA 21 thanks y Srinivas Iyyer wrote: Dear group, I have 3 different data frames. I want to merge all 3 data frames for which there is intersection. Say DF 1 and DF2 has 100 common elements in Column 1. DF3 does not have many intersection either with DF1 or with DF2. For names in column 1 not present in DF3 I want to introduce NA. DF1: Name Age A 21 B 45 C 30 DF2: Name Age A 50 B 20 X 10 DF3: Name Age B 40 Y 21 K 30 I want to merge all 3 into one: Df4: Name.1Age.1 Age.2 Age.3 A 21 50 NA B 45 20 40 C 30 NA NA K NA NA 30 X NA 10 NA Y NA NA 21 Could any one help me how can I merge 3 dataframes. appreciate your help. Thank you. srini __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. - Yasir H. Kaheil Catchment Research Facility The University of Western Ontario -- View this message in context: http://www.nabble.com/merging-data-frames-tp17286503p17287302.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.