Re: [R] take data from a file to another according to their correlation coefficient
Hi Rui it's me again. I would have another question in the function process.all you explained me. But as you already helped me a lot, and as I promised I won't disturb you again, I want to ask you first if you accept to help me one more time before telling you more precisely my problem (about adding an automatic linear regression in order to have more realistic filling data in the gaps). I wrote you a personal message (don't know if you got it), because I would like to send you a present from the Alps to thank you for all the help you gave me, and maybe the new help (and so to have your home or work postal address). If you agree, let me know and send me your address by mail. I'll explain in a new post what my boss wants me know to add in your function (this function is so tricky for me to understand with my small knowledge + Google + R help). -- View this message in context: http://r.789695.n4.nabble.com/take-data-from-a-file-to-another-according-to-their-correlation-coefficient-tp4580054p4605385.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] take data from a file to another according to their correlation coefficient
Hello Rui, For the write.table, it's OK! And for the second one (for the 2nd best correlation) seems to work great! You're too strong ^^ I have to check a bit more to be sure, but it seems to do it! If you come in the Alps, it will be more liqueurs such as Chartreuse or Génépi (from mountain plants) if you know them. I'll offer you one bottle if you come one day. I could even send it to you in portugal if you want. Thanks a lot again for all. Geoffrey -- View this message in context: http://r.789695.n4.nabble.com/take-data-from-a-file-to-another-according-to-their-correlation-coefficient-tp4580054p4590193.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] take data from a file to another according to their correlation coefficient
Seems to work great! I have a last question (or 2) for you about it, and I will leave you alone afterwords, I promise :) I tested your function process.all for the automatization. It seems to be OK. It's just when I'd like to save the filled data files. If I name process.all, for example: test - process.all(lst, corr2008) and I save it: write.table(test, ...) and I check the test file, It has filled my data but all the files from lst are in one file (the columns are: ST001, ST001_time, ST002, ST002_time, . (with ST001 for station 1 for example)). How can I cut these files and save them automatically (one file for ST001, another for ST002, ...) according to these columns names? And it is possible in your script to take the second best correlated station data instead of the best one, if there are NAs in this best correlated station at the same lines with the NA gaps of the station to fill? Thanks again for all your help. If you come one day in France near the Alps or Chamonix (where I'm working), just tell me. I'll pay you some beers or a restaurant! You deserve it ^^ By the way, where do my rescuer come from? Are you a statistician? Geoffrey -- View this message in context: http://r.789695.n4.nabble.com/take-data-from-a-file-to-another-according-to-their-correlation-coefficient-tp4580054p4586079.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] take data from a file to another according to their correlation coefficient
Hello, The first is easy. How can I cut these files and save them automatically (one file for ST001, another for ST002, ...) according to these columns names? Similar to the way they were read, using lapply on the results list. But first make a file names vector. (I've used the file extension 'dat'.) test - process.all(lst, m) fl.names - paste(names(test), dat, sep=.) lapply(seq_len(length(test)), function(i) write.table(test[[i]], fl.names[i], ...)) The second is trickier. And it is possible in your script to take the second best correlated station data instead of the best one, if there are NAs in this best correlated station at the same lines with the NA gaps of the station to fill? In the function 'process.all', after the internal function 'f', include the following. g - function(station){ x - df.list[[station]] if(any(is.na(x$data))){ mat[row(mat) == col(mat)] - -Inf nas - which(is.na(x$data)) ord - order(mat[station, ], decreasing = TRUE)[-c(1, ncol(mat))] for(i in nas){ for(y in ord){ if(!is.na(df.list[[y]]$data[i])){ x$data[i] - df.list[[y]]$data[i] break } } } } x } Then, change the second pass to # Note that the two passes are different df.list - lapply(seq.int(n), f) df.list - lapply(seq.int(n), g) And I come from Portugal. I'm a mathematician (with 6 semesters of stats). When I go to France, it's more to Charente-Maritime - Cognac, I have friends there, and I'll definitelly have a couple of cognacs on you. Good luck with your assignment. Rui Barradas -- View this message in context: http://r.789695.n4.nabble.com/take-data-from-a-file-to-another-according-to-their-correlation-coefficient-tp4580054p4586719.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] take data from a file to another according to their correlation coefficient
Hi again Rui, I tested your script as you wrote it with my examples, it works perfectly! It seems to be exactly what I'm trying to do. I just have a question about your function na.fill. When I'm trying to apply your script to my data, it doesn't work. I think it's because in your example, you already open the data.frames in your list. But in my case, these data.frames are in different files (as I have 70 files). I'm trying to apply your function na.fill on a list.files. That's why I think it tells me: Error dans x$data : $ operator is invalid for atomic vectors I tried like this: x[,2] but it doesn't work too: incorrect number of dimensions. How can I do exactly the same for na.fill, but by calling a file (according to the name of the file) and not directly a data.frame like you (s1,s2,s3)? -- View this message in context: http://r.789695.n4.nabble.com/take-data-from-a-file-to-another-according-to-their-correlation-coefficient-tp4580054p4583404.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] take data from a file to another according to their correlation coefficient
Hello, Try putting the function call in a lapply, along lst - lapply(list.files(path, pattern), read.table, header=TRUE, stringsAsFactors=FALSE) You don't stricktly need a list for na.fill, but you do need two data.frames, not filenames. The list is used by the other functions. (It's also a good idea to have related objects within the same data structure.) Rui Barradas -- View this message in context: http://r.789695.n4.nabble.com/take-data-from-a-file-to-another-according-to-their-correlation-coefficient-tp4580054p4583785.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] take data from a file to another according to their correlation coefficient
Hi everyone. I have a question about a work on R I have to do for my job. I have temperature data coming from 70 weather stations. One data file corresponds to one station for one year (so 70 files for one year). Each file looks like this (important: each file contains NAs): time data 01/01/2008 00:00 -0.25 01/01/2008 00:15 -0.18 01/01/2008 00:30 -0.25 01/01/2008 00:45 -0.25 (one column with date + time every 15mn for the whole year, and one column with data). I already did correlation matrices between my weather stations (in order to find the nearest). For example: Station1 Station2 Station3 [...] Station11 0.90.8 Station20.9 1 0.7 Station30.8 0.7 1 [...] Now, I would like to fill the NA data gaps of a station with data from another station according to their correlation coefficient. Let's take an example for the Station 1: if the most correlated Station with Station 1 is Station 2, it has to take data from Station 2 to fill NA gaps of Station 1, for the same date and hour of course (or same lines as I'm doing correlations for the same year). So for year 2008 (for example), if the correlation is the highest between Station 1 and 2 (according to all the Stations), and if the data are: timedata 01/01/2008 00:00 1 01/01/2008 00:15 2 FOR STATION 1 01/01/2008 00:30 *NA* 01/01/2008 00:45 4 and timedata 01/01/2008 00:00 8 01/01/2008 00:15 9 FOR STATION 2 for the same year and the same time 01/01/2008 00:30 *10 * 01/01/2008 00:45 11 The Station1 file should become: timedata 01/01/2008 00:00 1 01/01/2008 00:15 2 STATION 1 01/01/2008 00:30 *10 * 01/01/2008 00:45 4 Hope you've understood what I would like to do :) Thanks a lot for your ideas and your replies! -- View this message in context: http://r.789695.n4.nabble.com/take-data-from-a-file-to-another-according-to-their-correlation-coefficient-tp4580054p4580054.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] take data from a file to another according to their correlation coefficient
Hi, Even your example should show why this is a bad way to fill in missing weather data: you end up with a sequence for station 1 of 1, 2, 10, 4 even though that's certainly wrong because Station 2 is reliably 7 units above Station 1. Correlated doesn't mean identical. There are other better options. If you're only missing a single value, interpolation between the values you do have for that station is likely better. If you're missing lots, regression of that station with another correlated station would be the more reasonable way to do what you're trying to propose here. But in fact interpolation of weather data is vey complicated, and the subject of a lot of research. The most realistic methods use elevation as a covariate. These may well be overkill for your situation, though, unless you are missing whole days of data. Sarah On Apr 23, 2012, at 6:42 AM, jeff6868 geoffrey_kl...@etu.u-bourgogne.fr wrote: Hi everyone. I have a question about a work on R I have to do for my job. I have temperature data coming from 70 weather stations. One data file corresponds to one station for one year (so 70 files for one year). Each file looks like this (important: each file contains NAs): time data 01/01/2008 00:00 -0.25 01/01/2008 00:15 -0.18 01/01/2008 00:30 -0.25 01/01/2008 00:45 -0.25 (one column with date + time every 15mn for the whole year, and one column with data). I already did correlation matrices between my weather stations (in order to find the nearest). For example: Station1 Station2 Station3 [...] Station11 0.90.8 Station20.9 1 0.7 Station30.8 0.7 1 [...] Now, I would like to fill the NA data gaps of a station with data from another station according to their correlation coefficient. Let's take an example for the Station 1: if the most correlated Station with Station 1 is Station 2, it has to take data from Station 2 to fill NA gaps of Station 1, for the same date and hour of course (or same lines as I'm doing correlations for the same year). So for year 2008 (for example), if the correlation is the highest between Station 1 and 2 (according to all the Stations), and if the data are: timedata 01/01/2008 00:00 1 01/01/2008 00:15 2 FOR STATION 1 01/01/2008 00:30 *NA* 01/01/2008 00:45 4 and timedata 01/01/2008 00:00 8 01/01/2008 00:15 9 FOR STATION 2 for the same year and the same time 01/01/2008 00:30 *10 * 01/01/2008 00:45 11 The Station1 file should become: timedata 01/01/2008 00:00 1 01/01/2008 00:15 2 STATION 1 01/01/2008 00:30 *10 * 01/01/2008 00:45 4 Hope you've understood what I would like to do :) Thanks a lot for your ideas and your replies! __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] take data from a file to another according to their correlation coefficient
Hi Sarah, Thank you for your answer. Yes I know that my proposition is not necessary the better way to do it. But my problem concerns only big gaps of course (more than half a day of missing data, till several months of missing data). I've already filled small gaps with the interpolation that you were talking in your message (with the function na.approx of the package zoo). For the study, it's not important to have perfectly identical values between the 2 correlated stations, because I'll calculate after the reconstruction the daily mean of each station. For my boss, it's enough to work on daily means. But before that, I need to rebuild the big missing data gaps of my stations (by the way I explained in the first message of my topic). Do you have any idea of the way to do it on R according to my first post? I forgot to precise that my examples are completely fakes! I chose these numbers in order for you to understand what I want to do (I chose easy and readable numbers). I tested on excel with 2 stations, it was not too bad when I filled the gaps (between the data of the 2 well correlated stations). -- View this message in context: http://r.789695.n4.nabble.com/take-data-from-a-file-to-another-according-to-their-correlation-coefficient-tp4580054p4580296.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] take data from a file to another according to their correlation coefficient
Hi Rui, Yes you're right. It's me again ^^ This post is the last part (I hope) of my job. You helped me a lot last time for the correlation matrices. I have to leave my work now, so I'll check and test your proposition tomorrow. But it makes no doubt that it'll help me a lot again. I'll tell you tomorrow. Thanks Rui! -- View this message in context: http://r.789695.n4.nabble.com/take-data-from-a-file-to-another-according-to-their-correlation-coefficient-tp4580054p4580898.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] take data from a file to another according to their correlation coefficient
Hello, jeff6868 wrote Hi Sarah, Thank you for your answer. Yes I know that my proposition is not necessary the better way to do it. But my problem concerns only big gaps of course (more than half a day of missing data, till several months of missing data). I've already filled small gaps with the interpolation that you were talking in your message (with the function na.approx of the package zoo). For the study, it's not important to have perfectly identical values between the 2 correlated stations, because I'll calculate after the reconstruction the daily mean of each station. For my boss, it's enough to work on daily means. But before that, I need to rebuild the big missing data gaps of my stations (by the way I explained in the first message of my topic). Do you have any idea of the way to do it on R according to my first post? I forgot to precise that my examples are completely fakes! I chose these numbers in order for you to understand what I want to do (I chose easy and readable numbers). I tested on excel with 2 stations, it was not too bad when I filled the gaps (between the data of the 2 well correlated stations). I remember this data set from some time ago. (Weeks?) First of all, please use ?dput to post your data, it makes it much easier for everyone to just copy and paste to an R session. The output you should post looks like this: dput(s1) structure(list(time = c(01/01/2008 00:00, 01/01/2008 00:15, 01/01/2008 00:30, 01/01/2008 00:45), data = c(1L, 2L, NA, 4L)), .Names = c(time, data), row.names = c(NA, -4L), class = data.frame) dput(s2) structure(list(time = c(01/01/2008 00:00, 01/01/2008 00:15, 01/01/2008 00:30, 01/01/2008 00:45), data = 8:11), .Names = c(time, data), row.names = c(NA, -4L), class = data.frame) dput(s3) structure(list(time = c(01/01/2008 00:00, 01/01/2008 00:15, 01/01/2008 00:30, 01/01/2008 00:45), data = c(123L, NA, NA, NA)), .Names = c(time, data), row.names = c(NA, -4L), class = data.frame) dput(m) structure(c(1, 0.9, 0.8, 0.9, 1, 0.7, 0.8, 0.7, 1), .Dim = c(3L, 3L), .Dimnames = list(c(Station1, Station2, Station3), c(Station1, Station2, Station3))) I've named your data.frames 's1', 's2' and made up an 's3'; 'm' is the correlation matrix. Now the problem. Sarah's comment seems sensible, to just fill in missing values using some other dataset isn't very canonic but here it goes. It assumes the data frames are in a list. lst - list(s1, s2, s3) names(lst) - paste(Station, seq.int(length(lst)), sep=) lst # station - list number or name, not the data.frame # mat - correlation matrix get.max.cor - function(station, mat){ mat[row(mat) == col(mat)] - -Inf which( mat[station, ] == max(mat[station, ]) ) } # x - data.frame to be transformed # y - data.frame with greater correlation na.fill - function(x, y){ i - is.na(x$data) x$data[i] - y$data[i] x } mx.cor - get.max.cor(1, m) mx.cor na.fill(lst[[1]], lst[[mx.cor]]) Like it's said in the comments before the function, the call to the first function could be get.max.cor(Station1, m) The two functions above solve the problem, all what's left to do is to automate their calls. Note that there might be a need for two passes through 'na.fill', if the data.frame with greater correlation also has NAs. This is the case of Station1 filling in values for Station3. Try commenting out the second pass in the function below process.all - function(df.list, mat){ f - function(station) na.fill(df.list[[ station ]], df.list[[ max.cor[station] ]]) # n - length(df.list) nms - names(df.list) # First the max on each row max.cor - sapply(seq.int(n), get.max.cor, m) # Note the two passes df.list - lapply(seq.int(n), f) df.list - lapply(seq.int(n), f) # Makes nicer output names(df.list) - nms df.list } process.all(lst, m) Hope this helps, Rui Barradas -- View this message in context: http://r.789695.n4.nabble.com/take-data-from-a-file-to-another-according-to-their-correlation-coefficient-tp4580054p4580845.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.