Re: [R] identify duplicate from more than one column

David Winsemius Sun, 13 Nov 2011 04:53:27 -0800


On Nov 13, 2011, at 1:19 AM, Joshua Wiley wrote:

Hi Carlos,

Here is one option:

## read in your data
dat <- read.table(textConnection("
obs     unit            home       z    sex     age
1       015029  18             1        1       053
2       015029  18             1        2       049
3       015029  01             1        1       038
4       015029  01             1        2       033
5       015029  02             1        1       036
6       015029  02             1        2       033
7       015029  03             1        1       023
8       015029  03             1        2       019
9       015029  04             1        2       045
10      015029  05             1        2       047"),
 header = TRUE, stringsAsFactors = FALSE)
closeAllConnections()

## create a unique ID for matching unit and home

I would have used the 'interaction' function. And I would have readthe data in with colClasses to preserve the labeling of the householdinformation:


> dat <- read.table(textConnection("
+ obs     unit            home       z    sex     age
+ 1       015029  18             1        1       053
+ 2       015029  18             1        2       049
+ 3       015029  01             1        1       038
+ 4       015029  01             1        2       033
+ 5       015029  02             1        1       036
+ 6       015029  02             1        2       033
+ 7       015029  03             1        1       023
+ 8       015029  03             1        2       019
+ 9       015029  04             1        2       045
+ 10      015029  05             1        2       047"),

+ header = TRUE, stringsAsFactors = FALSE, colClasses=c(rep("factor",4), "numeric"))

> closeAllConnections()
>
> ## create a unique ID for matching unit and home
> dat
   obs   unit home z sex age
1    1 015029   18 1   1 053
2    2 015029   18 1   2 049
3    3 015029   01 1   1 038
4    4 015029   01 1   2 033
5    5 015029   02 1   1 036
6    6 015029   02 1   2 033
7    7 015029   03 1   1 023
8    8 015029   03 1   2 019
9    9 015029   04 1   2 045
10  10 015029   05 1   2 047
> dat$together <- with(dat, interaction(unit, home) )
> dat$togeth.n <- as.numeric(dat$together)
> dat
   obs   unit home z sex age  together togeth.n
1    1 015029   18 1   1 053 015029.18        6
2    2 015029   18 1   2 049 015029.18        6
3    3 015029   01 1   1 038 015029.01        1
4    4 015029   01 1   2 033 015029.01        1
5    5 015029   02 1   1 036 015029.02        2
6    6 015029   02 1   2 033 015029.02        2
7    7 015029   03 1   1 023 015029.03        3
8    8 015029   03 1   2 019 015029.03        3
9    9 015029   04 1   2 045 015029.04        4
10  10 015029   05 1   2 047 015029.05        5

I'm assuming that the numbering of the unit/household pairings issomewhat arbitrary. The I would set to missing all of the non-couplehouseholds:

> is.na(dat$togett.n) <- !as.logical( ave(dat$sex, dat$together,FUN=function(x) 1 %in% x & 2 %in% x) )

> dat
   obs   unit home z sex age  together togett.n
1    1 015029   18 1   1 053 015029.18        6
2    2 015029   18 1   2 049 015029.18        6
3    3 015029   01 1   1 038 015029.01        1
4    4 015029   01 1   2 033 015029.01        1
5    5 015029   02 1   1 036 015029.02        2
6    6 015029   02 1   2 033 015029.02        2
7    7 015029   03 1   1 023 015029.03        3
8    8 015029   03 1   2 019 015029.03        3
9    9 015029   04 1   2 045 015029.04       NA
10  10 015029   05 1   2 047 015029.05       NA

I actually think it would be better to use the fully labeled"together" variable rather than the numeric version. It retains itsdata heritage better.


--
David.

dat$mID <- with(dat, paste(unit, home, sep = ''))

## somewhat messy way of creating a couple number
## for each mID, if there is more than 1 row, and more than 1 sex
## it creates a couple id, otherwise 0
i <- 0L
dat$couple <- with(dat, unlist(lapply(split(sex, mID), function(x) {
 i <<- i + 1L
 if (length(x) > 1 && length(unique(x)) > 1) {
   rep(i, length(x))
 } else 0L
})))

## view results
dat
  obs  unit home z sex age     mID couple
1    1 15029   18 1   1  53 1502918      1
2    2 15029   18 1   2  49 1502918      1
3    3 15029    1 1   1  38  150291      2
4    4 15029    1 1   2  33  150291      2
5    5 15029    2 1   1  36  150292      3
6    6 15029    2 1   2  33  150292      3
7    7 15029    3 1   1  23  150293      4
8    8 15029    3 1   2  19  150293      4
9    9 15029    4 1   2  45  150294      0
10  10 15029    5 1   2  47  150295      0

See these functions for more details:

?ave # where I got my idea
?split
?lapply
?`<<-`

Cheers,

Josh

On Sat, Nov 12, 2011 at 8:16 PM, jour4life <jour4l...@gmail.com>wrote:

Hi all,

I've searched everywhere to try to find out how to do this and havehad no

luck. I am trying to construct identifiers for couples in a dataset.
Essentially, I want to identify couples using more than one column as
identifiers. Take for instance:

obs     unit            home       z    sex     age
1       015029  18             1        1       053
2       015029  18             1        2       049
3       015029  01             1        1       038
4       015029  01             1        2       033
5       015029  02             1        1       036
6       015029  02             1        2       033
7       015029  03             1        1       023
8       015029  03             1        2       019
9       015029  04             1        2       045
10      015029  05             1        2       047

Where unit is the housing unit, home is household. Of course, thereare morevalues for unit, although these first ten observations consist ofthe sameunit (which could possibly be an apartment complex). Nonetheless, Iwant toconstruct an identifier for couples if unit, home match, but onlyif bothmale and female are within the same household. Taking the exampledata

above, I want to see this:

       unit            home    z       sex     age      couple
1       015029  18             1        1       053      1
2       015029  18             1        2       049      1
3       015029  01             1        1       038      2
4       015029  01             1        2       033      2
5       015029  02             1        1       036      3
6       015029  02             1        2       033      3
7       015029  03             1        1       023      4
8       015029  03             1        2       019      4
9       015029  04             1        2       045      0
10      015029  05             1        2       047      0

As you can see in the last two observations, there were no malesidentifiedwithin the same household, thus the last two observations would notcontaincouple identifiers, rather some other identifier (but the same one)so I candetect them and remove them later. I've tried using the duplicatedfunction

but was not very useful.

Any help would be greatly appreciated!!!

Thanks,

Carlos

--
View this message in context: 
http://r.789695.n4.nabble.com/identify-duplicate-from-more-than-one-column-tp4035888p4035888.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] identify duplicate from more than one column

Reply via email to