On Nov 13, 2011, at 1:19 AM, Joshua Wiley wrote:

Hi Carlos,

Here is one option:

## read in your data
dat <- read.table(textConnection("
obs     unit            home       z    sex     age
1       015029  18             1        1       053
2       015029  18             1        2       049
3       015029  01             1        1       038
4       015029  01             1        2       033
5       015029  02             1        1       036
6       015029  02             1        2       033
7       015029  03             1        1       023
8       015029  03             1        2       019
9       015029  04             1        2       045
10      015029  05             1        2       047"),
 header = TRUE, stringsAsFactors = FALSE)
closeAllConnections()

## create a unique ID for matching unit and home

I would have used the 'interaction' function. And I would have read the data in with colClasses to preserve the labeling of the household information:

> dat <- read.table(textConnection("
+ obs     unit            home       z    sex     age
+ 1       015029  18             1        1       053
+ 2       015029  18             1        2       049
+ 3       015029  01             1        1       038
+ 4       015029  01             1        2       033
+ 5       015029  02             1        1       036
+ 6       015029  02             1        2       033
+ 7       015029  03             1        1       023
+ 8       015029  03             1        2       019
+ 9       015029  04             1        2       045
+ 10      015029  05             1        2       047"),
+ header = TRUE, stringsAsFactors = FALSE, colClasses=c(rep("factor", 4), "numeric"))
> closeAllConnections()
>
> ## create a unique ID for matching unit and home
> dat
   obs   unit home z sex age
1    1 015029   18 1   1 053
2    2 015029   18 1   2 049
3    3 015029   01 1   1 038
4    4 015029   01 1   2 033
5    5 015029   02 1   1 036
6    6 015029   02 1   2 033
7    7 015029   03 1   1 023
8    8 015029   03 1   2 019
9    9 015029   04 1   2 045
10  10 015029   05 1   2 047
> dat$together <- with(dat, interaction(unit, home) )
> dat$togeth.n <- as.numeric(dat$together)
> dat
   obs   unit home z sex age  together togeth.n
1    1 015029   18 1   1 053 015029.18        6
2    2 015029   18 1   2 049 015029.18        6
3    3 015029   01 1   1 038 015029.01        1
4    4 015029   01 1   2 033 015029.01        1
5    5 015029   02 1   1 036 015029.02        2
6    6 015029   02 1   2 033 015029.02        2
7    7 015029   03 1   1 023 015029.03        3
8    8 015029   03 1   2 019 015029.03        3
9    9 015029   04 1   2 045 015029.04        4
10  10 015029   05 1   2 047 015029.05        5

I'm assuming that the numbering of the unit/household pairings is somewhat arbitrary. The I would set to missing all of the non-couple households:

> is.na(dat$togett.n) <- !as.logical( ave(dat$sex, dat$together, FUN=function(x) 1 %in% x & 2 %in% x) )
> dat
   obs   unit home z sex age  together togett.n
1    1 015029   18 1   1 053 015029.18        6
2    2 015029   18 1   2 049 015029.18        6
3    3 015029   01 1   1 038 015029.01        1
4    4 015029   01 1   2 033 015029.01        1
5    5 015029   02 1   1 036 015029.02        2
6    6 015029   02 1   2 033 015029.02        2
7    7 015029   03 1   1 023 015029.03        3
8    8 015029   03 1   2 019 015029.03        3
9    9 015029   04 1   2 045 015029.04       NA
10  10 015029   05 1   2 047 015029.05       NA

I actually think it would be better to use the fully labeled "together" variable rather than the numeric version. It retains its data heritage better.

--
David.


dat$mID <- with(dat, paste(unit, home, sep = ''))

## somewhat messy way of creating a couple number
## for each mID, if there is more than 1 row, and more than 1 sex
## it creates a couple id, otherwise 0
i <- 0L
dat$couple <- with(dat, unlist(lapply(split(sex, mID), function(x) {
 i <<- i + 1L
 if (length(x) > 1 && length(unique(x)) > 1) {
   rep(i, length(x))
 } else 0L
})))

## view results
dat
  obs  unit home z sex age     mID couple
1    1 15029   18 1   1  53 1502918      1
2    2 15029   18 1   2  49 1502918      1
3    3 15029    1 1   1  38  150291      2
4    4 15029    1 1   2  33  150291      2
5    5 15029    2 1   1  36  150292      3
6    6 15029    2 1   2  33  150292      3
7    7 15029    3 1   1  23  150293      4
8    8 15029    3 1   2  19  150293      4
9    9 15029    4 1   2  45  150294      0
10  10 15029    5 1   2  47  150295      0

See these functions for more details:

?ave # where I got my idea
?split
?lapply
?`<<-`

Cheers,

Josh

On Sat, Nov 12, 2011 at 8:16 PM, jour4life <jour4l...@gmail.com> wrote:
Hi all,

I've searched everywhere to try to find out how to do this and have had no
luck. I am trying to construct identifiers for couples in a dataset.
Essentially, I want to identify couples using more than one column as
identifiers. Take for instance:

obs     unit            home       z    sex     age
1       015029  18             1        1       053
2       015029  18             1        2       049
3       015029  01             1        1       038
4       015029  01             1        2       033
5       015029  02             1        1       036
6       015029  02             1        2       033
7       015029  03             1        1       023
8       015029  03             1        2       019
9       015029  04             1        2       045
10      015029  05             1        2       047

Where unit is the housing unit, home is household. Of course, there are more values for unit, although these first ten observations consist of the same unit (which could possibly be an apartment complex). Nonetheless, I want to construct an identifier for couples if unit, home match, but only if both male and female are within the same household. Taking the example data
above, I want to see this:

       unit            home    z       sex     age      couple
1       015029  18             1        1       053      1
2       015029  18             1        2       049      1
3       015029  01             1        1       038      2
4       015029  01             1        2       033      2
5       015029  02             1        1       036      3
6       015029  02             1        2       033      3
7       015029  03             1        1       023      4
8       015029  03             1        2       019      4
9       015029  04             1        2       045      0
10      015029  05             1        2       047      0

As you can see in the last two observations, there were no males identified within the same household, thus the last two observations would not contain couple identifiers, rather some other identifier (but the same one) so I can detect them and remove them later. I've tried using the duplicated function
but was not very useful.

Any help would be greatly appreciated!!!

Thanks,

Carlos

--
View this message in context: 
http://r.789695.n4.nabble.com/identify-duplicate-from-more-than-one-column-tp4035888p4035888.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to