Re: [R] rbind of multiple data frames by column name, when each data frames can contain different columns

2022-06-03 Thread Avi Gross via R-help
Complexity has all kinds of costs and tradeoffs. In one sense you are doing a 
conceptual merge between related but not identical database tables albeit from 
a set of .CSV files.

You do have some  conceptually direct ways using base R or packages like dplyr 
that  possibly do it in a more complex way outside your view. Hypothetically, 
you could even work sideways and open a connection to a database and store the 
contents of each .CSV files into a relational table and then issue some SQL 
commands asking to do the merge and read them back but that would be weird and 
overkill as R has similar abilities.

But underneath the covers, as shown by the longer code, the R functionality has 
to determine what field/column names are common between pairs of data.frames 
and fill any missing ones with NA and so on, perhaps faster than the code being 
shown and perhaps not. Much depends on the code used and how general it is as 
it is often weighted down by lots of additional functionality you are not 
using, and whether it has big chunks being used that were rewritten in C or C++ 
or other libraries from FORTRAN ...

But unless the amount of data gets huge and it takes too much time or memory, I 
would opt for the conceptually simpler method of using merge() or the 
full_join() in dplyr, or as pointed out, other techniques by changing the 
formats. I can think of other ways such as getting the combined column names 
from all the files, then adding missing ones to each data.frame full of NA as 
needed, then reordering the columns so they all are identical, and finally 
vertically combining all the resulting rows. Again note the internal methods 
used by any merge method (including in SQL) may simply automate tasks.

The more complex methods shown strike me as a tad painful to read, let alone to 
modify when your needs change. Just because some people can program all kinds 
of things in low-level languages like assembler, that does not justify spending 
months on what can be done in days or even minutes using pre-built and 
pre-tested methods layered one upon another.

If there is already an acceptable function that already accepts a list of such 
overlapping data.frames, then other than HW problems, that might be even 
better. It looks fairly easy to construct such a function based on the 
discussion so I wonder if some package already has one.


-Original Message-
From: CALUM POLWART 
To: Bert Gunter 
Cc: r-help@R-project.org ; Stefano Sofia 

Sent: Fri, Jun 3, 2022 3:48 am
Subject: Re: [R] rbind of multiple data frames by column name, when each data 
frames can contain different columns

Bert! It sounds like you are warming to the the tidyverse! ;-)

I completed agree with your analysis this data would be best served long as
you have shown.

If the OP was going to use tidy to manipulate it, they can do the same with
tidyr::pivot_ functions (pivot_longer and pivot_wider). The result will be
the same - but understanding how it happened may be easier!

On Thu, 2 Jun 2022, 21:04 Bert Gunter,  wrote:

> Well, it seems better to me to put all the data frames in long format and
> then rbind them instead of the other way round, which results in the piles
> of NA's you see. I note also, FWIW, that this accords with the so-called
> "tidy" format that many advocate these days. You can always subset (rows)
> and choose by station, date, etc. as needed, of course from the long
> format.
>
> Because of the regularity of your data frames, it is easy to do this. Here
> is a little base R function that "reforms" each data frame (I suspect Rui
> may well provide a more elegant version, though):
>
>
> reform<- function(dat){
>    nm <- names(dat)
>    stanums <- unique(gsub("[^[:digit:]]","", nm[-1])) ## station numbers
> present
>    z <- do.call(rbind, lapply(stanums,
>      \(i)
>      structure(dat[,grep(i, nm, fixed = TRUE)],
>                names = c("Hs", "Hn", "flag"
>    data.frame(POSIX =rep(dat[,1], length(stanums)),
>              Station = rep(stanums, e = nrow(dat)),
>              z)
> }
>
> e.g.
> > reform(df2)
>        POSIX Station Hs Hn flag
> 1  2001-12-01      1 50 20    0
> 2  2001-12-02      1 60 20    0
> 3  2001-12-03      1 70 20    0
> 4  2001-12-04      1 NA NA  NA
> 5  2001-12-05      1 NA NA  NA
> 6  2001-12-01      3 20  0    0
> 7  2001-12-02      3 20  0    0
> 8  2001-12-03      3 30 10    0
> 9  2001-12-04      3 30  0    1
> 10 2001-12-05      3  0  5    0
>
> A call to rbind() of the following form then gives you all your data in
> long form(you may wish to use some shortcuts to form the list of frames):
>
> > do.call(rbind, lapply(list(df1, df2, df3), reform))
>        POSIX Station Hs Hn flag
> 1  2000-12-01      1 30 10    0
> 2  2000-12-02      

Re: [R] rbind of multiple data frames by column name, when each data frames can contain different columns

2022-06-03 Thread John Fox

Dear Stefano,

On 2022-06-03 8:43 a.m., Stefano Sofia wrote:

Thank you to all who provided useful hints, great as always.

In my opinion the solution given by Andrew is perfect, exactly what I wanted to 
do without changing the format of my data frames:


df_list <- list(df1, df2, df3)

allNms <- unique(unlist(lapply(df_list, names)))

do.call(rbind, c(lapply(df_list, function(x) data.frame(x, 
sapply(setdiff(allNms, names(x)), function(y) NA, simplify = FALSE))), 
make.row.names=FALSE))


I'm moderately surprised that you find this simpler than, say,

merge(merge(df1, df2, all=TRUE), df3, all=TRUE)[allNms]




Now I encountered this final problem: when I load my data frames in R through


df1 <- read.table(file="/mypath/df1.csv", header = TRUE, sep=" ", dec = ".", 
stringsAsFactors = TRUE)

df1$data_POSIX <- as.POSIXct(df1$data_POSIX, format = "%Y-%m-%d", 
tz="Etc/GMT-1")


I get the following error:


Error in data.frame(x, sapply(setdiff(allNms, names(x)), function(y) NA,  :
   arguments imply differing number of rows:5, 0


Why? df1 is a data frame correctly filled in.


I can't duplicate this error (and you don't provide the data files). A 
couple of thoughts:


(1) Why set stringsAsFactors = TRUE when you plan to convert the first 
column to dates (though that doesn't account for the error)?


(2) If df1.csv is really a csv file (and why use the extension .csv if 
it isn't?), why specify sep=" " (though if the data file is really a csv 
file, the resulting df1 would be corrupted, and you say it isn't).


Best,
John




Thank you again

Stefano


  (oo)
--oOO--( )--OOo--
Stefano Sofia PhD
Civil Protection - Marche Region - Italy
Meteo Section
Snow Section
Via del Colle Ameno 5
60126 Torrette di Ancona, Ancona (AN)
Uff: +39 071 806 7743
E-mail: stefano.so...@regione.marche.it
---Oo-oO


____________
Da: Andrew Simmons 
Inviato: venerd� 3 giugno 2022 11:06
A: Stefano Sofia
Oggetto: Re: [R] rbind of multiple data frames by column name, when each data 
frames can contain different columns

I think I see the problem. I forgot that lapply doesn't assign names where 
sapply does. I think you might want to use that instead, but also supply 
argument simplify as FALSE

sapply(setdiff(allNms, names(x)), function(y) NA, simplify = FALSE)

It should produce something more like

$Station2_Hs
[1] NA

$Station2_Hn
[1] NA

$Station2_flag
[1] NA

which should combine nicely with data.frame(x,)


On Fri, Jun 3, 2022, 03:25 Stefano Sofia 
mailto:stefano.so...@regione.marche.it>> wrote:

Good morning Andrew.

Thank you for your help.

Unfortunately your suggestion does not work, and I got different errors 
depending on the use of the three small examples or real data frames.

If I apply your code to the three small examples I gave you (df1, df2 and df3) 
I get:


Error in match.names(clabs, names(xi)) :
   names do not match previous names


I tried to understand the origin of the problem.


lapply(setdiff(allNms, names(x)), function(y) NA)

gives


[[1]]
[1] NA

[[2]]
[1] NA

[[3]]
[1] NA

[[4]]
[1] NA

[[5]]
[1] NA

[[6]]
[1] NA

[[7]]
[1] NA

[[8]]
[1] NA

[[9]]
[1] NA

[[10]]
[1] NA


and

data.frame(x, lapply(setdiff(allNms, names(x)), function(y) NA))

gives

x NA. NA..1 NA..2 NA..3 NA..4 NA..5 NA..6 NA..7 NA..8 NA..9
1  1  NANANANANANANANANANA
2  1  NANANANANANANANANANA
3  1  NANANANANANANANANANA
4  2  NANANANANANANANANANA
5  2  NANANANANANANANANANA
6  3  NANANANANANANANANANA
7  4  NANANANANANANANANANA
8  4  NANANANANANANANANANA
9  4  NANANANANANANANANANA
10 5  NANANANANANANANANANA
11 6  NANANANANANANANANANA
12 6  NANANANANANANANANANA
13 6  NANANANANANANANANANA


Why?

Unfortunately this code is too difficult for me.

Sorry for bothering you, thank you for what you have already done.

Stefano


  (oo)
--oOO--( )--OOo--
Stefano Sofia PhD
Civil Protection - Marche Region - Italy
Meteo Section
Snow Section
Via del Colle Ameno 5
60126 Torrette di Ancona, Ancona (AN)
Uff: +39 071 806 7743
E-mail: stefano.so...@regione.marche.it<mailto:stefano.so...@regione.marche.it>
---Oo-oO


____
Da: Andrew Simmons mailto:akwsi...@gmail.com>>
Inviato: gioved� 2 giugno 2022 08:21
A: Stefano Sofia
Oggetto: Re

[R] rbind of multiple data frames by column name, when each data frames can contain different columns

2022-06-01 Thread Stefano Sofia
Dear R-list users,

for each winter season from 2000 to 2022 I have a data frame collecting for 
different weather stations snowpack height (Hs), snowfall in the last 24h (Hn) 
and a validation flag.

Suppose I have these three following data frames


df1 <- data.frame(data_POSIX=seq(as.POSIXct("2000-12-01", format="%Y-%m-%d", 
tz="Etc/GMT-1"), as.POSIXct("2000-12-05", format="%Y-%m-%d", tz="Etc/GMT-1"), 
by="1 days"), Station1_Hs = c(30, 40, 50, NA, 55), Station1_Hn = c(10, 20, 10, 
NA, 5), Station1_flag = c(0, 0, 0, NA, 0), Station2_Hs = c(20, 20, 30, 30, 0), 
Station2_Hn = c(0, 0, 10, 0, 5), Station2_flag = c(0, 0, 0, 1, 0))


df2 <- data.frame(data_POSIX=seq(as.POSIXct("2001-12-01", format="%Y-%m-%d", 
tz="Etc/GMT-1"), as.POSIXct("2001-12-05", format="%Y-%m-%d", tz="Etc/GMT-1"), 
by="1 days"), Station1_Hs = c(50, 60, 70, NA, NA), Station1_Hn = c(20, 20, 20, 
NA, NA), Station1_flag = c(0, 0, 0, NA, NA), Station3_Hs = c(20, 20, 30, 30, 
0), Station3_Hn = c(0, 0, 10, 0, 5), Station3_flag = c(0, 0, 0, 1, 0))


df3 <- data.frame(data_POSIX=seq(as.POSIXct("2002-12-01", format="%Y-%m-%d", 
tz="Etc/GMT-1"), as.POSIXct("2002-12-05", format="%Y-%m-%d", tz="Etc/GMT-1"), 
by="1 days"), Station2_Hs = c(50, 60, 70, NA, NA), Station2_Hn = c(20, 20, 20, 
NA, NA), Station2_flag = c(0, 0, 0, NA, NA), Station3_Hs = c(20, 20, 30, 30, 
0), Station3_Hn = c(0, 0, 10, 0, 5), Station3_flag = c(0, 0, 0, 1, 0))


As you can see, each data frame can have different stations loaded.

I would need to call rbind matching data frames by column name (i.e. by station 
name), keeping in mind that the number of stations loaded in each data frame 
may differ. The result should be

data_POSIX Station1_Hs Station1_Hn Station1_flag Station2_Hs Station2_Hn 
Station2_flag Station3_Hs Station3_Hn Station3_flag
2000-12-01 30 10 0 20 0 0 NA NA NA
2000-12-02 40 20 0 20 0 0 NA NA NA
2000-12-03 50 10 0 30 10 0 NA NA NA
2000-12-04 NA NA NA 30 0 0 NA NA NA
2000-12-05 55 5 0 0 5 0 NA NA NA
2001-12-01 50 20 0 NA NA NA 20 0 0
2001-12-02 60 20 0 NA NA NA 20 0 0
2001-12-03 70 20 0 NA NA NA 30 10 0
2001-12-04 NA NA NA NA NA NA 30 0 1
2001-12-05 NA NA NA NA NA NA 0 5 0
2002-12-01 NA NA NA 50 20 0 20 0 0
2002-12-02 NA NA NA 60 20 0 20 0 0
2002-12-03 NA NA NA 70 20 0 30 10 0
2002-12-04 NA NA NA NA NA NA 30 0 1
2002-12-05 NA NA NA NA NA NA 0 5 0

I tried this code

df_list <- list(df1, df2, df3)
allNms <- unique(unlist(lapply(df_list, names)))
do.call(rbind, c(lapply(df_list, function(x) data.frame(c(x, 
sapply(setdiff(allNms, names(x)), function(y) NA, make.row.names=FALSE))

but I get this error:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = 
TRUE,  :
  arguments imply differing number of rows

Could someone please help me?


Thank you for your attention

Stefano


 (oo)
--oOO--( )--OOo--
Stefano Sofia PhD
Civil Protection - Marche Region - Italy
Meteo Section
Snow Section
Via del Colle Ameno 5
60126 Torrette di Ancona, Ancona (AN)
Uff: +39 071 806 7743
E-mail: stefano.so...@regione.marche.it
---Oo-oO



AVVISO IMPORTANTE: Questo messaggio di posta elettronica pu� contenere 
informazioni confidenziali, pertanto � destinato solo a persone autorizzate 
alla ricezione. I messaggi di posta elettronica per i client di Regione Marche 
possono contenere informazioni confidenziali e con privilegi legali. Se non si 
� il destinatario specificato, non leggere, copiare, inoltrare o archiviare 
questo messaggio. Se si � ricevuto questo messaggio per errore, inoltrarlo al 
mittente ed eliminarlo completamente dal sistema del proprio computer. Ai sensi 
dell'art. 6 della DGR n. 1394/2008 si segnala che, in caso di necessit� ed 
urgenza, la risposta al presente messaggio di posta elettronica pu� essere 
visionata da persone estranee al destinatario.
IMPORTANT NOTICE: This e-mail message is intended to be received only by 
persons entitled to receive the confidential information it may contain. E-mail 
messages to clients of Regione Marche may contain information that is 
confidential and legally privileged. Please do not read, copy, forward, or 
store this message unless you are an intended recipient of it. If you have 
received this message in error, please forward it to the sender and delete it 
completely from your computer system.

--
Questo messaggio  stato analizzato da Libraesva ESG ed  risultato non infetto.
This message was scanned by Libraesva ESG and is believed to be clean.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.