Re: [R] Merging two data frames with 3 common variables makes duplicated rows
Thomas, You are very clever! The meil2 data frame has twice the common variable combinations: meil2 dist sexe style meil 138F clas 02:43:17 238F free 02:24:46 338H clas 02:37:36 438H free 01:59:35 545F clas 03:46:15 645F free 02:20:15 745H clas 02:30:07 845H free 01:59:36 938F clas 02:43:17 10 38F free 02:24:46 11 38H clas 02:37:36 12 38H free 01:59:35 13 45F clas 03:46:15 14 45F free 02:20:15 15 45H clas 02:30:07 16 45H free 01:59:36 Keeping unique combinations merged correctly with the next data frame. This merge() function is more subtle than I first thought. That means when merging two data frames, if the resulting data frame has more rows than either former data frames, it means that there are duplicate combinations of the common variables in either or the two data frames. Thank you very much, I will try to be more careful about this. Rock Thomas Lumley wrote: On Fri, 8 May 2009, Rock Ouimet wrote: I am new to R (ex SAS user) , and I cannot merge two data frames without getting duplicated rows in the results. How to avoid this happening without using the unique() function? 1. First data frame is called tmv with 6 variables and 239 rows: tmv[1:10,] temps nomprenom sexe dist style 1 01:59:36 Cyr SteveH 45 free 2 02:09:55 Gosselin ErickH 45 free 3 02:12:18 Desfosses SachaH 45 free 4 02:12:23 Lapointe SebastienH 45 free 5 02:12:52LabrieMichelH 45 free 6 02:12:54 LeblancMichelH 45 free 7 02:13:02 Thibeault SylvainH 45 free 8 02:13:49Martel StephaneH 45 free 9 02:14:03Lavoie Jean-PhilippeH 45 free 10 02:14:05Boivin Jean-ClaudeH 45 free Its structure is: str(tmv) 'data.frame': 239 obs. of 6 variables: $ temps :Class 'times' atomic [1:239] 0.0831 0.0902 0.0919 0.0919 0.0923 ... .. ..- attr(*, format)= chr h:m:s $ nom : Factor w/ 167 levels Aubut,Audy,..: 45 84 55 105 98 110 158 117 109 22 ... $ prenom: Factor w/ 135 levels Alain,Alexandre,..: 128 33 121 122 93 93 130 126 63 59 ... $ sexe : Factor w/ 2 levels F,H: 2 2 2 2 2 2 2 2 2 2 ... $ dist : int 45 45 45 45 45 45 45 45 45 45 ... $ style : Factor w/ 2 levels clas,free: 2 2 2 2 2 2 2 2 2 2 ... 2. The second data frame is called meil2 with 4 variables and 16 rows; meil2[1:10,] dist sexe style meil 138F clas 02:43:17 238F free 02:24:46 338H clas 02:37:36 438H free 01:59:35 545F clas 03:46:15 645F free 02:20:15 745H clas 02:30:07 845H free 01:59:36 938F clas 02:43:17 10 38F free 02:24:46 Lines 9 and 1 appear to be the same in meil2, as do 2 and 10. If the 16 rows consist of two repeats of 8 rows that would explain why you are getting two copies of each individual in the output. unique(meil2) would have just the distinct rows. -thomas Thomas Lumley Assoc. Professor, Biostatistics tlum...@u.washington.edu University of Washington, Seattle __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Merging-two-data-frames-with-3-common-variables-makes-duplicated-rows-tp23454018p23459790.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Merging two data frames with 3 common variables makes duplicated rows
I am new to R (ex SAS user) , and I cannot merge two data frames without getting duplicated rows in the results. How to avoid this happening without using the unique() function? 1. First data frame is called tmv with 6 variables and 239 rows: tmv[1:10,] temps nomprenom sexe dist style 1 01:59:36 Cyr SteveH 45 free 2 02:09:55 Gosselin ErickH 45 free 3 02:12:18 Desfosses SachaH 45 free 4 02:12:23 Lapointe SebastienH 45 free 5 02:12:52LabrieMichelH 45 free 6 02:12:54 LeblancMichelH 45 free 7 02:13:02 Thibeault SylvainH 45 free 8 02:13:49Martel StephaneH 45 free 9 02:14:03Lavoie Jean-PhilippeH 45 free 10 02:14:05Boivin Jean-ClaudeH 45 free Its structure is: str(tmv) 'data.frame': 239 obs. of 6 variables: $ temps :Class 'times' atomic [1:239] 0.0831 0.0902 0.0919 0.0919 0.0923 ... .. ..- attr(*, format)= chr h:m:s $ nom : Factor w/ 167 levels Aubut,Audy,..: 45 84 55 105 98 110 158 117 109 22 ... $ prenom: Factor w/ 135 levels Alain,Alexandre,..: 128 33 121 122 93 93 130 126 63 59 ... $ sexe : Factor w/ 2 levels F,H: 2 2 2 2 2 2 2 2 2 2 ... $ dist : int 45 45 45 45 45 45 45 45 45 45 ... $ style : Factor w/ 2 levels clas,free: 2 2 2 2 2 2 2 2 2 2 ... 2. The second data frame is called meil2 with 4 variables and 16 rows; meil2[1:10,] dist sexe style meil 138F clas 02:43:17 238F free 02:24:46 338H clas 02:37:36 438H free 01:59:35 545F clas 03:46:15 645F free 02:20:15 745H clas 02:30:07 845H free 01:59:36 938F clas 02:43:17 10 38F free 02:24:46 Its structure is: str(tmv) 'data.frame': 239 obs. of 6 variables: $ temps :Class 'times' atomic [1:239] 0.0831 0.0902 0.0919 0.0919 0.0923 ... .. ..- attr(*, format)= chr h:m:s $ nom : Factor w/ 167 levels Aubut,Audy,..: 45 84 55 105 98 110 158 117 109 22 ... $ prenom: Factor w/ 135 levels Alain,Alexandre,..: 128 33 121 122 93 93 130 126 63 59 ... $ sexe : Factor w/ 2 levels F,H: 2 2 2 2 2 2 2 2 2 2 ... $ dist : int 45 45 45 45 45 45 45 45 45 45 ... $ style : Factor w/ 2 levels clas,free: 2 2 2 2 2 2 2 2 2 2 ... Note that the two data frames have sexe, dist, and style as common variables, and of the same class (Factor) and number of levels. When merging the two data frames into tmv3, the merging is fine, but all the rows get duplicated: tmv3 - merge(tmv, meil2, sort=TRUE, by=c(sexe, dist, style) ) tmv3[1:10,] sexe dist styletempsnom prenom meil 1 F 38 clas 02:49:15Boucher Marie-Amelie 02:43:17 2 F 38 clas 02:49:15Boucher Marie-Amelie 02:43:17 3 F 38 clas 03:24:05 Vachon Guylaine 02:43:17 4 F 38 clas 03:24:05 Vachon Guylaine 02:43:17 5 F 38 clas 03:13:11 Villeneuve Rejean 02:43:17 6 F 38 clas 03:13:11 Villeneuve Rejean 02:43:17 7 F 38 clas 03:37:54StevensJulie 02:43:17 8 F 38 clas 03:37:54StevensJulie 02:43:17 9 F 38 clas 03:53:03 Cote Marthe 02:43:17 10F 38 clas 03:53:03 Cote Marthe 02:43:17 Can anyone explain this behavior from R ? $version.string [1] R version 2.8.1 (2008-12-22) Rock [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merging two data frames with 3 common variables makes duplicated rows
On Fri, 8 May 2009, Rock Ouimet wrote: I am new to R (ex SAS user) , and I cannot merge two data frames without getting duplicated rows in the results. How to avoid this happening without using the unique() function? 1. First data frame is called tmv with 6 variables and 239 rows: tmv[1:10,] temps nomprenom sexe dist style 1 01:59:36 Cyr SteveH 45 free 2 02:09:55 Gosselin ErickH 45 free 3 02:12:18 Desfosses SachaH 45 free 4 02:12:23 Lapointe SebastienH 45 free 5 02:12:52LabrieMichelH 45 free 6 02:12:54 LeblancMichelH 45 free 7 02:13:02 Thibeault SylvainH 45 free 8 02:13:49Martel StephaneH 45 free 9 02:14:03Lavoie Jean-PhilippeH 45 free 10 02:14:05Boivin Jean-ClaudeH 45 free Its structure is: str(tmv) 'data.frame': 239 obs. of 6 variables: $ temps :Class 'times' atomic [1:239] 0.0831 0.0902 0.0919 0.0919 0.0923 ... .. ..- attr(*, format)= chr h:m:s $ nom : Factor w/ 167 levels Aubut,Audy,..: 45 84 55 105 98 110 158 117 109 22 ... $ prenom: Factor w/ 135 levels Alain,Alexandre,..: 128 33 121 122 93 93 130 126 63 59 ... $ sexe : Factor w/ 2 levels F,H: 2 2 2 2 2 2 2 2 2 2 ... $ dist : int 45 45 45 45 45 45 45 45 45 45 ... $ style : Factor w/ 2 levels clas,free: 2 2 2 2 2 2 2 2 2 2 ... 2. The second data frame is called meil2 with 4 variables and 16 rows; meil2[1:10,] dist sexe style meil 138F clas 02:43:17 238F free 02:24:46 338H clas 02:37:36 438H free 01:59:35 545F clas 03:46:15 645F free 02:20:15 745H clas 02:30:07 845H free 01:59:36 938F clas 02:43:17 10 38F free 02:24:46 Lines 9 and 1 appear to be the same in meil2, as do 2 and 10. If the 16 rows consist of two repeats of 8 rows that would explain why you are getting two copies of each individual in the output. unique(meil2) would have just the distinct rows. -thomas Thomas Lumley Assoc. Professor, Biostatistics tlum...@u.washington.eduUniversity of Washington, Seattle __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.