Re: [R] simplify a dataframe

Arnaud Michel Sun, 14 Jul 2013 09:19:58 -0700

Hi,
Excuse me for the indistinctness
Le 13/07/2013 17:18, arun a écrit :
> Hi,
> "when the value of Debut of lines i = value Fin of lines i-1"
> That part is not clear esp. when it is looked upon with the expected output 
> (df2).
I want to group the lines which have the same caracteristics (Matricule, 
Nom, Sexe, DateNaissance, Contrat, Pays) and with period of time 
(Debut/start and Fin/end) without interruption of time.
For exemple :
The following three lines : Debut/Start  Fin/End
1  VERON  Féminin    02/09/1935             CDI commun      France 
*24/01/1995* 30/04/1997
1  VERON  Féminin    02/09/1935             CDI commun      France 
01/05/1997 30/12/1997
1  VERON  Féminin    02/09/1935             CDI commun      France 
31/12/1997 *31/12/1997*
are transformed into 1 line
1  VERON  Féminin    02/09/1935             CDI commun      France 
*24/01/1995* *31/12/1997*
because same caracteristicsand period of time without interruption of 
time (from *24/01/1995* to *31/12/1997)*


The following six lines :
6 BENARD Masculin    01/04/1935             CDI commun Philippines 
*02/02/1995* 27/02/1995
6 BENARD Masculin    01/04/1935             CDI commun Philippines 
28/02/1995 28/02/1995
6 BENARD Masculin    01/04/1935             CDI commun Philippines 
01/03/1995 *12/03/1995*
6 BENARD Masculin    01/04/1935             CDI commun      France 
*13/03/1995* *30/06/1995*
6 BENARD Masculin    01/04/1935             CDI commun      France 
*01/01/1996* 30/01/1996
6 BENARD Masculin    01/04/1935             CDI commun      France 
31/01/1996 *31/01/1996*
are transformed into
6 BENARD Masculin    01/04/1935             CDI commun Philippines 
*02/02/1995* *12/03/1995*
6 BENARD Masculin    01/04/1935             CDI commun      France 
*13/03/1995* *30/06/1995*
6 BENARD Masculin    01/04/1935             CDI commun      France 
*01/01/1996* *31/01/1996*
because
lines 1-3 identical for caracteristics and without interruption in time
lines 4 and lines 5-6 are not grouped because there is an interruption 
in time beetween *30/06/1995 *and *01/01/1996*

Thank you for your help
Michel

>    Also, in your example dataset:
>
> df1$contrat[grep("^CDD",df1$contrat)]
> #[1] "CDD détaché ext. Cirad" "CDD détaché ext. Cirad" "CDD détaché ext. 
> Cirad"
> #[4] "CDD détaché ext. Cirad" "CDD détaché ext.Cirad"  "CDD détaché ext. 
> Cirad"
> #[7] "CDD détaché ext. Cirad" "CDD détaché ext.Cirad"  "CDD détaché ext. 
> Cirad"
> ##Looks like there are extra spaces in some of them.  I guess these are the 
> same
> df1$contrat[grep("^CDD",df1$contrat)]<- "CDD détaché ext. Cirad"
>
>
> I tried this:
> indx<-as.numeric(interaction(df1[,1:6],drop=FALSE))
>
>   df1New<- df1
> res2<-unique(within(df1New,{Debut<-ave(seq_along(indx),indx,FUN=function(x) 
> Debut[head(x,1)]);Fin<- ave(seq_along(indx),indx,FUN=function(x) 
> Fin[tail(x,1)])}))
>   row.names(res2)<- 1:nrow(res2)
>
> res2[,c(1,2,7:8)]
>     Matricule    Nom      Debut        Fin
> 1          1  VERON 24/01/1995 31/12/1997
> 2          6 BENARD 02/02/1995 12/03/1995
> 3          6 BENARD 13/03/1995 31/01/1996 ###here not correct
> 4          8 DALNIC 24/01/1995 31/08/1995
> 5          8 DALNIC 01/09/1995 29/02/2000
> 6        934  FORNI 26/01/1995 31/08/2001
> 7        934  FORNI 01/09/2001 31/08/2004
> 8        934  FORNI 01/09/2004 31/08/2007
> 9        934  FORNI 01/09/2007 04/09/2012
> 10       934  FORNI 05/09/2012 31/12/4712
>
>
> df2[,c(1,2,7:8)]
>     Mat    Nom      Debut        Fin
> 1    1  VERON 24/01/1995 31/12/1997
> 2    6 BENARD 02/02/1995 12/03/1995
> 3    6 BENARD 13/03/1995 30/06/1995
> 4    6 BENARD 01/01/1996 31/01/1996 #missing this row
> 5    8 DALNIC 24/01/1995 31/08/1995
> 6    8 DALNIC 01/09/1995 29/02/2000
> 7  934  FORNI 26/01/1995 31/08/2001
> 8  934  FORNI 01/09/2001 31/08/2004
> 9  934  FORNI 01/09/2004 31/08/2007
> 10 934  FORNI 01/09/2007 04/09/2012
> 11 934  FORNI 05/09/2012 31/12/4712
>
>
> Here, the dates look similar to the ones on df2 except for one row in df2.
>
> A.K.
>
>
>
>
> ----- Original Message -----
> From: Arnaud Michel <michel.arn...@cirad.fr>
> To: R help <r-help@r-project.org>
> Cc:
> Sent: Friday, July 12, 2013 3:45 PM
> Subject: [R] simplify a dataframe
>
> Hello
>
> I have the following problem : group the lines of a dataframe when no
> information change (Matricule, Nom, Sexe, DateNaissance, Contrat, Pays)
> and when the value of Debut of lines i = value Fin of lines i-1
> I can obtain it with a do loop. Is it possible to avoid the loop ?
>
> The dataframe initial is df1
> dput(df1)
> structure(list(Matricule = c(1L, 1L, 1L, 6L, 6L, 6L, 6L, 6L,
> 6L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 934L, 934L, 934L, 934L,
> 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L,
> 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L,
> 934L, 934L, 934L, 934L), Nom = c("VERON", "VERON", "VERON", "BENARD",
> "BENARD", "BENARD", "BENARD", "BENARD", "BENARD", "DALNIC", "DALNIC",
> "DALNIC", "DALNIC", "DALNIC", "DALNIC", "DALNIC", "DALNIC", "DALNIC",
> "FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI",
> "FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI",
> "FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI",
> "FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI",
> "FORNI", "FORNI"), Sexe = c("Féminin", "Féminin", "Féminin",
> "Masculin", "Masculin", "Masculin", "Masculin", "Masculin", "Masculin",
> "Féminin", "Féminin", "Féminin", "Féminin", "Féminin", "Féminin",
> "Féminin", "Féminin", "Féminin", "Masculin", "Masculin", "Masculin",
> "Masculin", "Masculin", "Masculin", "Masculin", "Masculin", "Masculin",
> "Masculin", "Masculin", "Masculin", "Masculin", "Masculin", "Masculin",
> "Masculin", "Masculin", "Masculin", "Masculin", "Masculin", "Masculin",
> "Masculin", "Masculin", "Masculin", "Masculin", "Masculin", "Masculin",
> "Masculin", "Masculin", "Masculin"), DateNaissance = c("02/09/1935",
> "02/09/1935", "02/09/1935", "01/04/1935", "01/04/1935", "01/04/1935",
> "01/04/1935", "01/04/1935", "01/04/1935", "19/02/1940", "19/02/1940",
> "19/02/1940", "19/02/1940", "19/02/1940", "19/02/1940", "19/02/1940",
> "19/02/1940", "19/02/1940", "10/07/1961", "10/07/1961", "10/07/1961",
> "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961",
> "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961",
> "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961",
> "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961",
> "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961",
> "10/07/1961", "10/07/1961"), contrat = c("CDI commun", "CDI commun",
> "CDI commun", "CDI commun", "CDI commun", "CDI commun", "CDI commun",
> "CDI commun", "CDI commun", "CDI commun", "CDI commun", "CDI commun",
> "CDI commun", "CDI commun", "CDI commun", "CDI commun", "CDI commun",
> "CDI commun", "CDD détaché ext. Cirad", "CDD détaché ext. Cirad",
> "CDD détaché ext. Cirad", "CDD détaché ext. Cirad", "CDD détaché ext.
> Cirad",
> "CDD détaché ext. Cirad", "CDD détaché ext. Cirad", "CDD détaché ext.
> Cirad",
> "CDD détaché ext. Cirad", "CDI Détachés Autres", "CDI Détachés Autres",
> "CDI Détachés Autres", "CDI Détachés Autres", "CDI Détachés Autres",
> "CDI Détachés Autres", "CDI Détachés Autres", "CDI Détachés Autres",
> "CDI Détachés Autres", "CDI Détachés Autres", "CDI Détachés Autres",
> "CDI Détachés Autres", "CDI Détachés Autres", "CDI commun", "CDI commun",
> "CDI commun", "CDI commun", "CDI commun", "CDI commun", "CDI commun",
> "CDI commun"), Pays = c("France", "France", "France", "Philippines",
> "Philippines", "Philippines", "France", "France", "France", "France",
> "France", "Martinique", "Martinique", "Martinique", "Martinique",
> "Martinique", "Martinique", "Martinique", "Cameroun", "Cameroun",
> "Cameroun", "Cameroun", "Cameroun", "Cameroun", "Cameroun", "Cameroun",
> "Cameroun", "France", "France", "France", "France", "France",
> "France", "France", "Congo", "Congo", "Congo", "Congo", "Congo",
> "Congo", "Gabon", "Gabon", "Gabon", "Gabon", "Gabon", "Gabon",
> "Congo", "Congo"), Debut = c("24/01/1995", "01/05/1997", "31/12/1997",
> "02/02/1995", "28/02/1995", "01/03/1995", "13/03/1995", "01/01/1996",
> "31/01/1996", "24/01/1995", "01/07/1995", "01/09/1995", "01/07/1997",
> "01/01/1998", "01/08/1998", "01/01/2000", "17/01/2000", "29/02/2000",
> "26/01/1995", "01/07/1996", "16/09/1997", "01/01/1998", "01/07/1998",
> "04/11/1999", "01/01/2001", "01/04/2001", "31/08/2001", "01/09/2001",
> "02/09/2001", "01/12/2001", "01/02/2003", "01/04/2003", "01/01/2004",
> "01/03/2004", "01/09/2004", "01/01/2005", "01/04/2005", "28/10/2006",
> "01/01/2007", "01/04/2007", "01/09/2007", "01/01/2009", "01/04/2009",
> "01/01/2010", "01/01/2011", "01/04/2011", "05/09/2012", "01/01/2013"
> ), Fin = c("30/04/1997", "30/12/1997", "31/12/1997", "27/02/1995",
> "28/02/1995", "12/03/1995", "30/06/1995", "30/01/1996", "31/01/1996",
> "30/06/1995", "31/08/1995", "30/06/1997", "31/12/1997", "31/07/1998",
> "31/12/1999", "16/01/2000", "28/02/2000", "29/02/2000", "30/06/1996",
> "15/09/1997", "31/12/1997", "30/06/1998", "03/11/1999", "31/12/2000",
> "31/03/2001", "30/08/2001", "31/08/2001", "01/09/2001", "30/11/2001",
> "31/01/2003", "31/03/2003", "31/12/2003", "29/02/2004", "31/08/2004",
> "31/12/2004", "31/03/2005", "27/10/2006", "31/12/2006", "31/03/2007",
> "31/08/2007", "31/12/2008", "31/03/2009", "31/12/2009", "31/12/2010",
> "31/03/2011", "04/09/2012", "31/12/2012", "31/12/4712")), .Names =
> c("Matricule",
> "Nom", "Sexe", "DateNaissance", "contrat", "Pays", "Debut", "Fin"
> ), class = "data.frame", row.names = c(NA, -48L))
>
> The dataframe to be obtained is df2
> dput(df2)
> structure(list(Mat = c(1L, 6L, 6L, 6L, 8L, 8L, 934L, 934L, 934L,
> 934L, 934L), Nom = c("VERON", "BENARD", "BENARD", "BENARD", "DALNIC",
> "DALNIC", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI"), Sexe = c("Féminin",
> "Masculin", "Masculin", "Masculin", "Féminin", "Féminin", "Masculin",
> "Masculin", "Masculin", "Masculin", "Masculin"), DateNaissance =
> c("02/09/1935",
> "01/04/1935", "01/04/1935", "01/04/1935", "19/02/1940", "19/02/1940",
> "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961"
> ), contrat = c("CDI commun", "CDI commun", "CDI commun", "CDI commun",
> "CDI commun", "CDI commun", "CDD détaché ext. Cirad", "CDI Détachés
> Autres",
> "CDI Détachés Autres", "CDI commun", "CDI commun"), Pays = c("France",
> "Philippines", "France", "France", "France", "Martinique", "Cameroun",
> "France", "Congo", "Gabon", "Congo"), Debut = c("24/01/1995",
> "02/02/1995", "13/03/1995", "01/01/1996", "24/01/1995", "01/09/1995",
> "26/01/1995", "01/09/2001", "01/09/2004", "01/09/2007", "05/09/2012"
> ), Fin = c("31/12/1997", "12/03/1995", "30/06/1995", "31/01/1996",
> "31/08/1995", "29/02/2000", "31/08/2001", "31/08/2004", "31/08/2007",
> "04/09/2012", "31/12/4712")), .Names = c("Mat", "Nom", "Sexe",
> "DateNaissance", "contrat", "Pays", "Debut", "Fin"), class =
> "data.frame", row.names = c(NA,
> -11L))
>
> Thank you for your help
>

-- 
Michel ARNAUD
Chargé de mission auprès du DRH
DGDRD-Drh - TA 174/04
Av Agropolis 34398 Montpellier cedex 5
tel : 04.67.61.75.38
fax : 04.67.61.57.87
port: 06.47.43.55.31


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] simplify a dataframe

Reply via email to