Re: [R] Merging two data frames with 3 common variables makes duplicated rows

2009-05-09 Thread Rocko22

Thomas,

You are very clever! The meil2 data frame has twice the common variable
combinations:

 meil2
   dist sexe style meil
138F  clas 02:43:17
238F  free 02:24:46
338H  clas 02:37:36
438H  free 01:59:35
545F  clas 03:46:15
645F  free 02:20:15
745H  clas 02:30:07
845H  free 01:59:36
938F  clas 02:43:17
10   38F  free 02:24:46
11   38H  clas 02:37:36
12   38H  free 01:59:35
13   45F  clas 03:46:15
14   45F  free 02:20:15
15   45H  clas 02:30:07
16   45H  free 01:59:36

Keeping unique combinations merged correctly with the next data frame. This
merge() function is more subtle than I first thought. That means when
merging two data frames, if the resulting data frame has more rows than
either former data frames, it means that there are duplicate combinations of
the common variables in either or the two data frames.

Thank you very much, I will try to be more careful about this.

Rock


Thomas Lumley wrote:
 
 On Fri, 8 May 2009, Rock Ouimet wrote:
 
 I am new to R (ex SAS user) , and I cannot merge two data frames without
 getting duplicated rows in the results. How to avoid this happening
 without
 using the unique() function?

 1. First data frame is called tmv with 6 variables and 239 rows:

 tmv[1:10,]
  temps   nomprenom sexe dist style
 1  01:59:36   Cyr SteveH   45  free
 2  02:09:55  Gosselin ErickH   45  free
 3  02:12:18 Desfosses SachaH   45  free
 4  02:12:23  Lapointe SebastienH   45  free
 5  02:12:52LabrieMichelH   45  free
 6  02:12:54   LeblancMichelH   45  free
 7  02:13:02 Thibeault   SylvainH   45  free
 8  02:13:49Martel  StephaneH   45  free
 9  02:14:03Lavoie Jean-PhilippeH   45  free
 10 02:14:05Boivin   Jean-ClaudeH   45  free

 Its structure is:
 str(tmv)
 'data.frame':   239 obs. of  6 variables:
 $ temps :Class 'times'  atomic [1:239] 0.0831 0.0902 0.0919 0.0919 0.0923
 ...
  .. ..- attr(*, format)= chr h:m:s
 $ nom   : Factor w/ 167 levels Aubut,Audy,..: 45 84 55 105 98 110 158
 117 109 22 ...
 $ prenom: Factor w/ 135 levels Alain,Alexandre,..: 128 33 121 122 93
 93
 130 126 63 59 ...
 $ sexe  : Factor w/ 2 levels F,H: 2 2 2 2 2 2 2 2 2 2 ...
 $ dist  : int  45 45 45 45 45 45 45 45 45 45 ...
 $ style : Factor w/ 2 levels clas,free: 2 2 2 2 2 2 2 2 2 2 ...


 2. The second data frame is called meil2 with 4 variables and 16 rows;
 meil2[1:10,]
   dist sexe style meil
 138F  clas 02:43:17
 238F  free 02:24:46
 338H  clas 02:37:36
 438H  free 01:59:35
 545F  clas 03:46:15
 645F  free 02:20:15
 745H  clas 02:30:07
 845H  free 01:59:36
 938F  clas 02:43:17
 10   38F  free 02:24:46
 
 
 Lines 9 and 1 appear to be the same in meil2, as do 2 and 10.  If the 16
 rows consist of two repeats of 8 rows that would explain why you are
 getting two copies of each individual in the output. unique(meil2) would
 have just the distinct rows.
 
   -thomas
 
 Thomas Lumley Assoc. Professor, Biostatistics
 tlum...@u.washington.edu  University of Washington, Seattle
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 

-- 
View this message in context: 
http://www.nabble.com/Merging-two-data-frames-with-3-common-variables-makes-duplicated-rows-tp23454018p23459790.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Merging two data frames with 3 common variables makes duplicated rows

2009-05-08 Thread Rock Ouimet
I am new to R (ex SAS user) , and I cannot merge two data frames without
getting duplicated rows in the results. How to avoid this happening without
using the unique() function?

1. First data frame is called tmv with 6 variables and 239 rows:

 tmv[1:10,]
  temps   nomprenom sexe dist style
1  01:59:36   Cyr SteveH   45  free
2  02:09:55  Gosselin ErickH   45  free
3  02:12:18 Desfosses SachaH   45  free
4  02:12:23  Lapointe SebastienH   45  free
5  02:12:52LabrieMichelH   45  free
6  02:12:54   LeblancMichelH   45  free
7  02:13:02 Thibeault   SylvainH   45  free
8  02:13:49Martel  StephaneH   45  free
9  02:14:03Lavoie Jean-PhilippeH   45  free
10 02:14:05Boivin   Jean-ClaudeH   45  free

Its structure is:
 str(tmv)
'data.frame':   239 obs. of  6 variables:
 $ temps :Class 'times'  atomic [1:239] 0.0831 0.0902 0.0919 0.0919 0.0923
...
  .. ..- attr(*, format)= chr h:m:s
 $ nom   : Factor w/ 167 levels Aubut,Audy,..: 45 84 55 105 98 110 158
117 109 22 ...
 $ prenom: Factor w/ 135 levels Alain,Alexandre,..: 128 33 121 122 93 93
130 126 63 59 ...
 $ sexe  : Factor w/ 2 levels F,H: 2 2 2 2 2 2 2 2 2 2 ...
 $ dist  : int  45 45 45 45 45 45 45 45 45 45 ...
 $ style : Factor w/ 2 levels clas,free: 2 2 2 2 2 2 2 2 2 2 ...


2. The second data frame is called meil2 with 4 variables and 16 rows;
 meil2[1:10,]
   dist sexe style meil
138F  clas 02:43:17
238F  free 02:24:46
338H  clas 02:37:36
438H  free 01:59:35
545F  clas 03:46:15
645F  free 02:20:15
745H  clas 02:30:07
845H  free 01:59:36
938F  clas 02:43:17
10   38F  free 02:24:46

Its structure is:
 str(tmv)
'data.frame':   239 obs. of  6 variables:
 $ temps :Class 'times'  atomic [1:239] 0.0831 0.0902 0.0919 0.0919 0.0923
...
  .. ..- attr(*, format)= chr h:m:s
 $ nom   : Factor w/ 167 levels Aubut,Audy,..: 45 84 55 105 98 110 158
117 109 22 ...
 $ prenom: Factor w/ 135 levels Alain,Alexandre,..: 128 33 121 122 93 93
130 126 63 59 ...
 $ sexe  : Factor w/ 2 levels F,H: 2 2 2 2 2 2 2 2 2 2 ...
 $ dist  : int  45 45 45 45 45 45 45 45 45 45 ...
 $ style : Factor w/ 2 levels clas,free: 2 2 2 2 2 2 2 2 2 2 ...

Note that the two data frames have sexe, dist, and style as common
variables, and of the same class (Factor) and number of levels.

When merging the two data frames into tmv3, the merging is fine, but all
the rows get duplicated:
 tmv3 - merge(tmv, meil2, sort=TRUE, by=c(sexe, dist, style)
)
 tmv3[1:10,]
   sexe dist styletempsnom   prenom meil
1 F   38  clas 02:49:15Boucher Marie-Amelie 02:43:17
2 F   38  clas 02:49:15Boucher Marie-Amelie 02:43:17
3 F   38  clas 03:24:05 Vachon Guylaine 02:43:17
4 F   38  clas 03:24:05 Vachon Guylaine 02:43:17
5 F   38  clas 03:13:11 Villeneuve   Rejean 02:43:17
6 F   38  clas 03:13:11 Villeneuve   Rejean 02:43:17
7 F   38  clas 03:37:54StevensJulie 02:43:17
8 F   38  clas 03:37:54StevensJulie 02:43:17
9 F   38  clas 03:53:03   Cote   Marthe 02:43:17
10F   38  clas 03:53:03   Cote   Marthe 02:43:17

Can anyone explain this behavior from R ?
$version.string
[1] R version 2.8.1 (2008-12-22)

Rock

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Merging two data frames with 3 common variables makes duplicated rows

2009-05-08 Thread Thomas Lumley

On Fri, 8 May 2009, Rock Ouimet wrote:


I am new to R (ex SAS user) , and I cannot merge two data frames without
getting duplicated rows in the results. How to avoid this happening without
using the unique() function?

1. First data frame is called tmv with 6 variables and 239 rows:


tmv[1:10,]

 temps   nomprenom sexe dist style
1  01:59:36   Cyr SteveH   45  free
2  02:09:55  Gosselin ErickH   45  free
3  02:12:18 Desfosses SachaH   45  free
4  02:12:23  Lapointe SebastienH   45  free
5  02:12:52LabrieMichelH   45  free
6  02:12:54   LeblancMichelH   45  free
7  02:13:02 Thibeault   SylvainH   45  free
8  02:13:49Martel  StephaneH   45  free
9  02:14:03Lavoie Jean-PhilippeH   45  free
10 02:14:05Boivin   Jean-ClaudeH   45  free

Its structure is:

str(tmv)

'data.frame':   239 obs. of  6 variables:
$ temps :Class 'times'  atomic [1:239] 0.0831 0.0902 0.0919 0.0919 0.0923
...
 .. ..- attr(*, format)= chr h:m:s
$ nom   : Factor w/ 167 levels Aubut,Audy,..: 45 84 55 105 98 110 158
117 109 22 ...
$ prenom: Factor w/ 135 levels Alain,Alexandre,..: 128 33 121 122 93 93
130 126 63 59 ...
$ sexe  : Factor w/ 2 levels F,H: 2 2 2 2 2 2 2 2 2 2 ...
$ dist  : int  45 45 45 45 45 45 45 45 45 45 ...
$ style : Factor w/ 2 levels clas,free: 2 2 2 2 2 2 2 2 2 2 ...


2. The second data frame is called meil2 with 4 variables and 16 rows;

meil2[1:10,]

  dist sexe style meil
138F  clas 02:43:17
238F  free 02:24:46
338H  clas 02:37:36
438H  free 01:59:35
545F  clas 03:46:15
645F  free 02:20:15
745H  clas 02:30:07
845H  free 01:59:36
938F  clas 02:43:17
10   38F  free 02:24:46



Lines 9 and 1 appear to be the same in meil2, as do 2 and 10.  If the 16 rows 
consist of two repeats of 8 rows that would explain why you are getting two 
copies of each individual in the output. unique(meil2) would have just the 
distinct rows.

 -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
tlum...@u.washington.eduUniversity of Washington, Seattle

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.