Re: [R] Choose between duplicated rows

jim holtman Sat, 14 Apr 2012 13:10:16 -0700

try this:

> x  # print data
   id     A v1 v2 v3 v4 v5 numMiss
1 id1 11905 NA NA NA  N  0       3
2 id1 11907  3  2  1  Y  0       0
3 id1 11907 NA NA NA  N  0       3
4 id2 11829  1  2  1  Y  1       0
5 id2 11829  2 NA NA  N  0       2
6 id2 11829 NA NA NA  N  0       3
> # select best data
> xBest <- do.call(rbind, lapply(split(x, x$A), function(.grp){
+     best <- which.min(apply(.grp, 1, function(a) sum(is.na(a))))
+     .grp[best, ]
+ }))
> xBest
       id     A v1 v2 v3 v4 v5 numMiss
11829 id2 11829  1  2  1  Y  1       0
11905 id1 11905 NA NA NA  N  0       3
11907 id1 11907  3  2  1  Y  0       0
>
> xWorst <- do.call(rbind, lapply(split(x, x$A), function(.grp){
+     worst <- which.max(apply(.grp, 1, function(a) sum(is.na(a))))
+     .grp[worst, ]
+ }))
> xWorst
       id     A v1 v2 v3 v4 v5 numMiss
11829 id2 11829 NA NA NA  N  0       3
11905 id1 11905 NA NA NA  N  0       3
11907 id1 11907 NA NA NA  N  0       3
>
>
>



On Sat, Apr 14, 2012 at 3:03 PM, francy <francy.casal...@gmail.com> wrote:

> Dear r experts,
>
> Sorry for this basic question, but I can't seem to find a solution
>
> I have this data frame:
> df <- data.frame(id = c("id1", "id1", "id1", "id2", "id2", "id2"), A =
> c(11905, 11907, 11907, 11829, 11829, 11829), v1 = c(NA, 3, NA,1,2,NA), v2 =
> c(NA,2,NA, 2, NA,NA), v3 = c(NA,1,NA,1,NA,NA), v4 = c("N", "Y", "N", "Y",
> "N","N"), v5 = c(0,0,0,1,0,0), numMiss=c(3,0,3,0,2,3))
>
> > df
>   id     A v1 v2 v3 v4 v5                numMiss
> 1 id1 11905 NA NA NA  N  0        3
> 2 id1 11907  3  2  1  Y  0                 0
> 3 id1 11907 NA NA NA  N  0        3
> 4 id2 11829  1  2  1  Y  1                 0
> 5 id2 11829  2 NA NA  N  0          2
> 6 id2 11829 NA NA NA  N  0       3
>
>
> And I need to keep, of the rows that have the same value for "A" by id,
> only
> the ones with the least amount of missing values for all the variables
> (with
> min(numMiss)) to get this:
>
>   id     A v1 v2 v3 v4 v5                numMiss
> 1 id1 11905 NA NA NA  N  0        3
> 2 id1 11907  3  2  1  Y  0                 0
> 4 id2 11829  1  2  1  Y  1                 0
>
> Then I have to choose the records with the least value of "A" of the rows
> that have the same id like this:
>   id     A v1 v2 v3 v4 v5                numMiss
> 1 id1 11905 NA NA NA  N  0        3
> 4 id2 11829  1  2  1  Y  1                 0
>
> For groupings I have used the package "plyr" before, but this would involve
> a sort of double-grouping by id and by duplicated values of ACould you
> please help me understand how this can be done?
>
> Thank you very much.
> -f
>
>
>
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Choose-between-duplicated-rows-tp4557833p4557833.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Choose between duplicated rows

Reply via email to