Re: [R] simplify a dataframe

Arnaud Michel Sun, 14 Jul 2013 11:21:58 -0700

An other remark :
If I calculate
df1$D<- as.numeric(as.Date(paste0(substr(df1$Debut,7,10),"-",
substr(df1$Debut,4,5),"-",substr(df1$Debut,1,2))))
and
df1$F <- as.numeric(as.Date(paste0(substr(df1$Fin,7,10),"-",
substr(df1$Fin,4,5),"-",substr(df1$Fin,1,2))))
and if there is no interruption of time for the lines i and i+1
then df1$F[i] + 1 == df1$D[i+1]
Michel


Le 14/07/2013 18:17, Arnaud Michel a écrit :
> Hi,
> Excuse me for the indistinctness
> Le 13/07/2013 17:18, arun a écrit :
>> Hi,
>> "when the value of Debut of lines i = value Fin of lines i-1"
>> That part is not clear esp. when it is looked upon with the expected output 
>> (df2).
> I want to group the lines which have the same caracteristics (Matricule,
> Nom, Sexe, DateNaissance, Contrat, Pays) and with period of time
> (Debut/start and Fin/end) without interruption of time.
> For exemple :
> The following three lines : Debut/Start  Fin/End
> 1  VERON  Féminin    02/09/1935             CDI commun      France
> *24/01/1995* 30/04/1997
> 1  VERON  Féminin    02/09/1935             CDI commun      France
> 01/05/1997 30/12/1997
> 1  VERON  Féminin    02/09/1935             CDI commun      France
> 31/12/1997 *31/12/1997*
> are transformed into 1 line
> 1  VERON  Féminin    02/09/1935             CDI commun      France
> *24/01/1995* *31/12/1997*
> because same caracteristicsand period of time without interruption of
> time (from *24/01/1995* to *31/12/1997)*
>
> The following six lines :
> 6 BENARD Masculin    01/04/1935             CDI commun Philippines
> *02/02/1995* 27/02/1995
> 6 BENARD Masculin    01/04/1935             CDI commun Philippines
> 28/02/1995 28/02/1995
> 6 BENARD Masculin    01/04/1935             CDI commun Philippines
> 01/03/1995 *12/03/1995*
> 6 BENARD Masculin    01/04/1935             CDI commun      France
> *13/03/1995* *30/06/1995*
> 6 BENARD Masculin    01/04/1935             CDI commun      France
> *01/01/1996* 30/01/1996
> 6 BENARD Masculin    01/04/1935             CDI commun      France
> 31/01/1996 *31/01/1996*
> are transformed into
> 6 BENARD Masculin    01/04/1935             CDI commun Philippines
> *02/02/1995* *12/03/1995*
> 6 BENARD Masculin    01/04/1935             CDI commun      France
> *13/03/1995* *30/06/1995*
> 6 BENARD Masculin    01/04/1935             CDI commun      France
> *01/01/1996* *31/01/1996*
> because
> lines 1-3 identical for caracteristics and without interruption in time
> lines 4 and lines 5-6 are not grouped because there is an interruption
> in time beetween *30/06/1995 *and *01/01/1996*
>
> Thank you for your help
> Michel
>
>>     Also, in your example dataset:
>>
>> df1$contrat[grep("^CDD",df1$contrat)]
>> #[1] "CDD détaché ext. Cirad" "CDD détaché ext. Cirad" "CDD détaché ext. 
>> Cirad"
>> #[4] "CDD détaché ext. Cirad" "CDD détaché ext.Cirad"  "CDD détaché ext. 
>> Cirad"
>> #[7] "CDD détaché ext. Cirad" "CDD détaché ext.Cirad"  "CDD détaché ext. 
>> Cirad"
>> ##Looks like there are extra spaces in some of them.  I guess these are the 
>> same
>> df1$contrat[grep("^CDD",df1$contrat)]<- "CDD détaché ext. Cirad"
>>
>>
>> I tried this:
>> indx<-as.numeric(interaction(df1[,1:6],drop=FALSE))
>>
>>    df1New<- df1
>> res2<-unique(within(df1New,{Debut<-ave(seq_along(indx),indx,FUN=function(x) 
>> Debut[head(x,1)]);Fin<- ave(seq_along(indx),indx,FUN=function(x) 
>> Fin[tail(x,1)])}))
>>    row.names(res2)<- 1:nrow(res2)
>>
>> res2[,c(1,2,7:8)]
>>      Matricule    Nom      Debut        Fin
>> 1          1  VERON 24/01/1995 31/12/1997
>> 2          6 BENARD 02/02/1995 12/03/1995
>> 3          6 BENARD 13/03/1995 31/01/1996 ###here not correct
>> 4          8 DALNIC 24/01/1995 31/08/1995
>> 5          8 DALNIC 01/09/1995 29/02/2000
>> 6        934  FORNI 26/01/1995 31/08/2001
>> 7        934  FORNI 01/09/2001 31/08/2004
>> 8        934  FORNI 01/09/2004 31/08/2007
>> 9        934  FORNI 01/09/2007 04/09/2012
>> 10       934  FORNI 05/09/2012 31/12/4712
>>
>>
>> df2[,c(1,2,7:8)]
>>      Mat    Nom      Debut        Fin
>> 1    1  VERON 24/01/1995 31/12/1997
>> 2    6 BENARD 02/02/1995 12/03/1995
>> 3    6 BENARD 13/03/1995 30/06/1995
>> 4    6 BENARD 01/01/1996 31/01/1996 #missing this row
>> 5    8 DALNIC 24/01/1995 31/08/1995
>> 6    8 DALNIC 01/09/1995 29/02/2000
>> 7  934  FORNI 26/01/1995 31/08/2001
>> 8  934  FORNI 01/09/2001 31/08/2004
>> 9  934  FORNI 01/09/2004 31/08/2007
>> 10 934  FORNI 01/09/2007 04/09/2012
>> 11 934  FORNI 05/09/2012 31/12/4712
>>
>>
>> Here, the dates look similar to the ones on df2 except for one row in df2.
>>
>> A.K.
>>
>>
>>
>>
>> ----- Original Message -----
>> From: Arnaud Michel <michel.arn...@cirad.fr>
>> To: R help <r-help@r-project.org>
>> Cc:
>> Sent: Friday, July 12, 2013 3:45 PM
>> Subject: [R] simplify a dataframe
>>
>> Hello
>>
>> I have the following problem : group the lines of a dataframe when no
>> information change (Matricule, Nom, Sexe, DateNaissance, Contrat, Pays)
>> and when the value of Debut of lines i = value Fin of lines i-1
>> I can obtain it with a do loop. Is it possible to avoid the loop ?
>>
>> The dataframe initial is df1
>> dput(df1)
>> structure(list(Matricule = c(1L, 1L, 1L, 6L, 6L, 6L, 6L, 6L,
>> 6L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 934L, 934L, 934L, 934L,
>> 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L,
>> 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L,
>> 934L, 934L, 934L, 934L), Nom = c("VERON", "VERON", "VERON", "BENARD",
>> "BENARD", "BENARD", "BENARD", "BENARD", "BENARD", "DALNIC", "DALNIC",
>> "DALNIC", "DALNIC", "DALNIC", "DALNIC", "DALNIC", "DALNIC", "DALNIC",
>> "FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI",
>> "FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI",
>> "FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI",
>> "FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI",
>> "FORNI", "FORNI"), Sexe = c("Féminin", "Féminin", "Féminin",
>> "Masculin", "Masculin", "Masculin", "Masculin", "Masculin", "Masculin",
>> "Féminin", "Féminin", "Féminin", "Féminin", "Féminin", "Féminin",
>> "Féminin", "Féminin", "Féminin", "Masculin", "Masculin", "Masculin",
>> "Masculin", "Masculin", "Masculin", "Masculin", "Masculin", "Masculin",
>> "Masculin", "Masculin", "Masculin", "Masculin", "Masculin", "Masculin",
>> "Masculin", "Masculin", "Masculin", "Masculin", "Masculin", "Masculin",
>> "Masculin", "Masculin", "Masculin", "Masculin", "Masculin", "Masculin",
>> "Masculin", "Masculin", "Masculin"), DateNaissance = c("02/09/1935",
>> "02/09/1935", "02/09/1935", "01/04/1935", "01/04/1935", "01/04/1935",
>> "01/04/1935", "01/04/1935", "01/04/1935", "19/02/1940", "19/02/1940",
>> "19/02/1940", "19/02/1940", "19/02/1940", "19/02/1940", "19/02/1940",
>> "19/02/1940", "19/02/1940", "10/07/1961", "10/07/1961", "10/07/1961",
>> "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961",
>> "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961",
>> "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961",
>> "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961",
>> "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961",
>> "10/07/1961", "10/07/1961"), contrat = c("CDI commun", "CDI commun",
>> "CDI commun", "CDI commun", "CDI commun", "CDI commun", "CDI commun",
>> "CDI commun", "CDI commun", "CDI commun", "CDI commun", "CDI commun",
>> "CDI commun", "CDI commun", "CDI commun", "CDI commun", "CDI commun",
>> "CDI commun", "CDD détaché ext. Cirad", "CDD détaché ext. Cirad",
>> "CDD détaché ext. Cirad", "CDD détaché ext. Cirad", "CDD détaché ext.
>> Cirad",
>> "CDD détaché ext. Cirad", "CDD détaché ext. Cirad", "CDD détaché ext.
>> Cirad",
>> "CDD détaché ext. Cirad", "CDI Détachés Autres", "CDI Détachés Autres",
>> "CDI Détachés Autres", "CDI Détachés Autres", "CDI Détachés Autres",
>> "CDI Détachés Autres", "CDI Détachés Autres", "CDI Détachés Autres",
>> "CDI Détachés Autres", "CDI Détachés Autres", "CDI Détachés Autres",
>> "CDI Détachés Autres", "CDI Détachés Autres", "CDI commun", "CDI commun",
>> "CDI commun", "CDI commun", "CDI commun", "CDI commun", "CDI commun",
>> "CDI commun"), Pays = c("France", "France", "France", "Philippines",
>> "Philippines", "Philippines", "France", "France", "France", "France",
>> "France", "Martinique", "Martinique", "Martinique", "Martinique",
>> "Martinique", "Martinique", "Martinique", "Cameroun", "Cameroun",
>> "Cameroun", "Cameroun", "Cameroun", "Cameroun", "Cameroun", "Cameroun",
>> "Cameroun", "France", "France", "France", "France", "France",
>> "France", "France", "Congo", "Congo", "Congo", "Congo", "Congo",
>> "Congo", "Gabon", "Gabon", "Gabon", "Gabon", "Gabon", "Gabon",
>> "Congo", "Congo"), Debut = c("24/01/1995", "01/05/1997", "31/12/1997",
>> "02/02/1995", "28/02/1995", "01/03/1995", "13/03/1995", "01/01/1996",
>> "31/01/1996", "24/01/1995", "01/07/1995", "01/09/1995", "01/07/1997",
>> "01/01/1998", "01/08/1998", "01/01/2000", "17/01/2000", "29/02/2000",
>> "26/01/1995", "01/07/1996", "16/09/1997", "01/01/1998", "01/07/1998",
>> "04/11/1999", "01/01/2001", "01/04/2001", "31/08/2001", "01/09/2001",
>> "02/09/2001", "01/12/2001", "01/02/2003", "01/04/2003", "01/01/2004",
>> "01/03/2004", "01/09/2004", "01/01/2005", "01/04/2005", "28/10/2006",
>> "01/01/2007", "01/04/2007", "01/09/2007", "01/01/2009", "01/04/2009",
>> "01/01/2010", "01/01/2011", "01/04/2011", "05/09/2012", "01/01/2013"
>> ), Fin = c("30/04/1997", "30/12/1997", "31/12/1997", "27/02/1995",
>> "28/02/1995", "12/03/1995", "30/06/1995", "30/01/1996", "31/01/1996",
>> "30/06/1995", "31/08/1995", "30/06/1997", "31/12/1997", "31/07/1998",
>> "31/12/1999", "16/01/2000", "28/02/2000", "29/02/2000", "30/06/1996",
>> "15/09/1997", "31/12/1997", "30/06/1998", "03/11/1999", "31/12/2000",
>> "31/03/2001", "30/08/2001", "31/08/2001", "01/09/2001", "30/11/2001",
>> "31/01/2003", "31/03/2003", "31/12/2003", "29/02/2004", "31/08/2004",
>> "31/12/2004", "31/03/2005", "27/10/2006", "31/12/2006", "31/03/2007",
>> "31/08/2007", "31/12/2008", "31/03/2009", "31/12/2009", "31/12/2010",
>> "31/03/2011", "04/09/2012", "31/12/2012", "31/12/4712")), .Names =
>> c("Matricule",
>> "Nom", "Sexe", "DateNaissance", "contrat", "Pays", "Debut", "Fin"
>> ), class = "data.frame", row.names = c(NA, -48L))
>>
>> The dataframe to be obtained is df2
>> dput(df2)
>> structure(list(Mat = c(1L, 6L, 6L, 6L, 8L, 8L, 934L, 934L, 934L,
>> 934L, 934L), Nom = c("VERON", "BENARD", "BENARD", "BENARD", "DALNIC",
>> "DALNIC", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI"), Sexe = c("Féminin",
>> "Masculin", "Masculin", "Masculin", "Féminin", "Féminin", "Masculin",
>> "Masculin", "Masculin", "Masculin", "Masculin"), DateNaissance =
>> c("02/09/1935",
>> "01/04/1935", "01/04/1935", "01/04/1935", "19/02/1940", "19/02/1940",
>> "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961"
>> ), contrat = c("CDI commun", "CDI commun", "CDI commun", "CDI commun",
>> "CDI commun", "CDI commun", "CDD détaché ext. Cirad", "CDI Détachés
>> Autres",
>> "CDI Détachés Autres", "CDI commun", "CDI commun"), Pays = c("France",
>> "Philippines", "France", "France", "France", "Martinique", "Cameroun",
>> "France", "Congo", "Gabon", "Congo"), Debut = c("24/01/1995",
>> "02/02/1995", "13/03/1995", "01/01/1996", "24/01/1995", "01/09/1995",
>> "26/01/1995", "01/09/2001", "01/09/2004", "01/09/2007", "05/09/2012"
>> ), Fin = c("31/12/1997", "12/03/1995", "30/06/1995", "31/01/1996",
>> "31/08/1995", "29/02/2000", "31/08/2001", "31/08/2004", "31/08/2007",
>> "04/09/2012", "31/12/4712")), .Names = c("Mat", "Nom", "Sexe",
>> "DateNaissance", "contrat", "Pays", "Debut", "Fin"), class =
>> "data.frame", row.names = c(NA,
>> -11L))
>>
>> Thank you for your help
>>
>
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Michel ARNAUD
Chargé de mission auprès du DRH
DGDRD-Drh - TA 174/04
Av Agropolis 34398 Montpellier cedex 5
tel : 04.67.61.75.38
fax : 04.67.61.57.87
port: 06.47.43.55.31


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] simplify a dataframe

Reply via email to