Hi, suppose that I have the following data.frame: cnae4 cnpj 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Y 24996 10020470 1 1 2 12 16 21 17 51 43 19 183 24996 10020470 69 91 79 92 91 77 90 96 98 108 891 36145 10020470 0 0 0 0 2 83 112 97 91 144 529 44444 10023333 5 20 60 0 0 0 0 5 20 1000 1110
I would like to create a new variable X that indicates which line, within the cnpj variable, has the highest value Y. For instance, within the cnpj = 10020470, the second line has the largest value Y (891). For cnpj = 10023333 is trivial (1110). Then, my new data.frame would become: cnae4 cnpj 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Y X 24996 10020470 1 1 2 12 16 21 17 51 43 19 183 FALSE 24996 10020470 69 91 79 92 91 77 90 96 98 108 891 TRUE 36145 10020470 0 0 0 0 2 83 112 97 91 144 529 FALSE 44444 10023333 5 20 60 0 0 0 0 5 20 1000 1110 TRUE Notice that for every value of the variable cnpj, only one line will have X = TRUE. Then, I would like to create a variable Z that is the sum of variable Y, also by variable cnpj. Thus, if cnpj = 10020470, Z = 183 + 891 +529 and for cnpj = 10023333, Z = 120. These sums can easily be done with tapply or aggregate but those would eliminate line with equal cnpj and I donât want that. I would like to achieve a data.frame like the following: cnae4 cnpj 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Y X Z 24996 10020470 1 1 2 12 16 21 17 51 43 19 183 FALSE 1603 24996 10020470 69 91 79 92 91 77 90 96 98 108 891 TRUE 1603 36145 10020470 0 0 0 0 2 83 112 97 91 144 529 FALSE 1603 44444 10023333 5 20 60 0 0 0 0 5 20 1000 1110 TRUE 1110 In the end I will eliminate all lines with X = FALSE. Thank you and sorry for the long question. Mateus Rabello [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.