Re: [R] Calculate daily means from 5-minute interval data
This problem nearly always boils down to using meta knowledge about the file. Having informal TZ info in the file is very helpful, but PST is not necessarily a uniquely-defined time zone specification so you have to draw on information outside of the file to know that these codes correspond to -0800 etc. (e.g. CST could be China Standard Time or US Central Standard Time.) Thus, it is tough to make this into a broadly-useful function. You can also construct the timezone column from knowledge about the location of interest and the monotonicity of the time data. https://jdnewmil.github.io/eci298sp2016/QuickHowtos1.html#handling-time-data ... but the answer to "easy" seems firmly in the eyes of the beholder. On September 5, 2021 10:18:48 AM PDT, Bill Dunlap wrote: >What is the best way to read (from a text file) timestamps from the fall >time change, where there are two 1:15am's? E.g., here is an extract from a >US Geological Survey web site giving data on the river through our county >on 2020-11-01, when we changed from PDT to PST, >https://nwis.waterdata.usgs.gov/wa/nwis/uv/?cb_00010=on_00060=on_00065=on=rdb_no=12200500=_date=2020-11-01_date=2020-11-05 >. > >The timestamps include the date and time as well as PDT or PST. > >river <- >c("datetime,tz,discharge,height,temp", > "2020-11-01 00:00,PDT,20500,16.44,9.3", > "2020-11-01 00:15,PDT,20500,16.44,9.3", > "2020-11-01 00:30,PDT,20500,16.43,9.3", > "2020-11-01 00:45,PDT,20400,16.40,9.3", > "2020-11-01 01:00,PDT,20400,16.40,9.3", > "2020-11-01 01:00,PST,20200,16.34,9.2", > "2020-11-01 01:15,PDT,20400,16.39,9.3", > "2020-11-01 01:15,PST,20200,16.34,9.2", > "2020-11-01 01:30,PDT,20300,16.37,9.2", > "2020-11-01 01:30,PST,20100,16.31,9.2", > "2020-11-01 01:45,PDT,20300,16.35,9.2", > "2020-11-01 01:45,PST,20100,16.29,9.2", > "2020-11-01 02:00,PST,20100,16.29,9.2", > "2020-11-01 02:15,PST,2,16.27,9.1", > "2020-11-01 02:30,PST,2,16.26,9.1" > ) >d <- read.table(text=river, sep=",",header=TRUE) > >The entries are obviously not in time order. > >Is there a simple way to read the timedate and tz columns together? One >way is to use d$tz to construct an offset that can be read with >strptime's "%z". > >> d$POSIXct <- >as.POSIXct(paste(d$datetime,ifelse(d$tz=="PDT","-0700","-0800")), >format="%Y-%m-%d %H:%M %z") >> d > datetime tz discharge height temp POSIXct >1 2020-11-01 00:00 PDT 20500 16.44 9.3 2020-11-01 00:00:00 >2 2020-11-01 00:15 PDT 20500 16.44 9.3 2020-11-01 00:15:00 >3 2020-11-01 00:30 PDT 20500 16.43 9.3 2020-11-01 00:30:00 >4 2020-11-01 00:45 PDT 20400 16.40 9.3 2020-11-01 00:45:00 >5 2020-11-01 01:00 PDT 20400 16.40 9.3 2020-11-01 01:00:00 >6 2020-11-01 01:00 PST 20200 16.34 9.2 2020-11-01 01:00:00 >7 2020-11-01 01:15 PDT 20400 16.39 9.3 2020-11-01 01:15:00 >8 2020-11-01 01:15 PST 20200 16.34 9.2 2020-11-01 01:15:00 >9 2020-11-01 01:30 PDT 20300 16.37 9.2 2020-11-01 01:30:00 >10 2020-11-01 01:30 PST 20100 16.31 9.2 2020-11-01 01:30:00 >11 2020-11-01 01:45 PDT 20300 16.35 9.2 2020-11-01 01:45:00 >12 2020-11-01 01:45 PST 20100 16.29 9.2 2020-11-01 01:45:00 >13 2020-11-01 02:00 PST 20100 16.29 9.2 2020-11-01 02:00:00 >14 2020-11-01 02:15 PST 2 16.27 9.1 2020-11-01 02:15:00 >15 2020-11-01 02:30 PST 2 16.26 9.1 2020-11-01 02:30:00 >> with(d[order(d$POSIXct),], plot(temp)) # monotonic temperature > >-Bill > > >On Thu, Sep 2, 2021 at 12:41 PM Jeff Newmiller >wrote: > >> Regardless of whether you use the lower-level split function, or the >> higher-level aggregate function, or the tidyverse group_by function, the >> key is learning how to create the column that is the same for all records >> corresponding to the time interval of interest. >> >> If you convert the sampdate to POSIXct, the tz IS important, because most >> of us use local timezones that respect daylight savings time, and a naive >> conversion of standard time will run into trouble if R is assuming daylight >> savings time applies. The lubridate package gets around this by always >> assuming UTC and giving you a function to "fix" the timezone after the >> conversion. I prefer to always be specific about timezones, at least by >> using so something like >> >> Sys.setenv( TZ = "Etc/GMT+8" ) >> >> which does not respect daylight savings. >> >> Regarding using character data for identifying the month, in order to have >> clean plots of the data I prefer to use the trunc function but it returns a >> POSIXlt so I convert it to POSIXct: >> >> discharge$sampmonthbegin <- as.POSIXct( trunc( discharge$sampdate, >> units = "months" ) ) >> >> Then any of various ways can be used to aggregate the records by that >> column. >> >> On September 2, 2021 12:10:15 PM PDT, Andrew Simmons >> wrote: >> >You could use 'split' to create a list of data frames, and then apply a >> >function to each to get the means and sds. >> > >> > >> >cols <- "cfs"
Re: [R] Calculate daily means from 5-minute interval data
What is the best way to read (from a text file) timestamps from the fall time change, where there are two 1:15am's? E.g., here is an extract from a US Geological Survey web site giving data on the river through our county on 2020-11-01, when we changed from PDT to PST, https://nwis.waterdata.usgs.gov/wa/nwis/uv/?cb_00010=on_00060=on_00065=on=rdb_no=12200500=_date=2020-11-01_date=2020-11-05 . The timestamps include the date and time as well as PDT or PST. river <- c("datetime,tz,discharge,height,temp", "2020-11-01 00:00,PDT,20500,16.44,9.3", "2020-11-01 00:15,PDT,20500,16.44,9.3", "2020-11-01 00:30,PDT,20500,16.43,9.3", "2020-11-01 00:45,PDT,20400,16.40,9.3", "2020-11-01 01:00,PDT,20400,16.40,9.3", "2020-11-01 01:00,PST,20200,16.34,9.2", "2020-11-01 01:15,PDT,20400,16.39,9.3", "2020-11-01 01:15,PST,20200,16.34,9.2", "2020-11-01 01:30,PDT,20300,16.37,9.2", "2020-11-01 01:30,PST,20100,16.31,9.2", "2020-11-01 01:45,PDT,20300,16.35,9.2", "2020-11-01 01:45,PST,20100,16.29,9.2", "2020-11-01 02:00,PST,20100,16.29,9.2", "2020-11-01 02:15,PST,2,16.27,9.1", "2020-11-01 02:30,PST,2,16.26,9.1" ) d <- read.table(text=river, sep=",",header=TRUE) The entries are obviously not in time order. Is there a simple way to read the timedate and tz columns together? One way is to use d$tz to construct an offset that can be read with strptime's "%z". > d$POSIXct <- as.POSIXct(paste(d$datetime,ifelse(d$tz=="PDT","-0700","-0800")), format="%Y-%m-%d %H:%M %z") > d datetime tz discharge height temp POSIXct 1 2020-11-01 00:00 PDT 20500 16.44 9.3 2020-11-01 00:00:00 2 2020-11-01 00:15 PDT 20500 16.44 9.3 2020-11-01 00:15:00 3 2020-11-01 00:30 PDT 20500 16.43 9.3 2020-11-01 00:30:00 4 2020-11-01 00:45 PDT 20400 16.40 9.3 2020-11-01 00:45:00 5 2020-11-01 01:00 PDT 20400 16.40 9.3 2020-11-01 01:00:00 6 2020-11-01 01:00 PST 20200 16.34 9.2 2020-11-01 01:00:00 7 2020-11-01 01:15 PDT 20400 16.39 9.3 2020-11-01 01:15:00 8 2020-11-01 01:15 PST 20200 16.34 9.2 2020-11-01 01:15:00 9 2020-11-01 01:30 PDT 20300 16.37 9.2 2020-11-01 01:30:00 10 2020-11-01 01:30 PST 20100 16.31 9.2 2020-11-01 01:30:00 11 2020-11-01 01:45 PDT 20300 16.35 9.2 2020-11-01 01:45:00 12 2020-11-01 01:45 PST 20100 16.29 9.2 2020-11-01 01:45:00 13 2020-11-01 02:00 PST 20100 16.29 9.2 2020-11-01 02:00:00 14 2020-11-01 02:15 PST 2 16.27 9.1 2020-11-01 02:15:00 15 2020-11-01 02:30 PST 2 16.26 9.1 2020-11-01 02:30:00 > with(d[order(d$POSIXct),], plot(temp)) # monotonic temperature -Bill On Thu, Sep 2, 2021 at 12:41 PM Jeff Newmiller wrote: > Regardless of whether you use the lower-level split function, or the > higher-level aggregate function, or the tidyverse group_by function, the > key is learning how to create the column that is the same for all records > corresponding to the time interval of interest. > > If you convert the sampdate to POSIXct, the tz IS important, because most > of us use local timezones that respect daylight savings time, and a naive > conversion of standard time will run into trouble if R is assuming daylight > savings time applies. The lubridate package gets around this by always > assuming UTC and giving you a function to "fix" the timezone after the > conversion. I prefer to always be specific about timezones, at least by > using so something like > > Sys.setenv( TZ = "Etc/GMT+8" ) > > which does not respect daylight savings. > > Regarding using character data for identifying the month, in order to have > clean plots of the data I prefer to use the trunc function but it returns a > POSIXlt so I convert it to POSIXct: > > discharge$sampmonthbegin <- as.POSIXct( trunc( discharge$sampdate, > units = "months" ) ) > > Then any of various ways can be used to aggregate the records by that > column. > > On September 2, 2021 12:10:15 PM PDT, Andrew Simmons > wrote: > >You could use 'split' to create a list of data frames, and then apply a > >function to each to get the means and sds. > > > > > >cols <- "cfs" # add more as necessary > >S <- split(discharge[cols], format(discharge$sampdate, format = "%Y-%m")) > >means <- do.call("rbind", lapply(S, colMeans, na.rm = TRUE)) > >sds <- do.call("rbind", lapply(S, function(xx) sapply(xx, sd, na.rm = > >TRUE))) > > > >On Thu, Sep 2, 2021 at 3:01 PM Rich Shepard > >wrote: > > > >> On Thu, 2 Sep 2021, Rich Shepard wrote: > >> > >> > If I correctly understand the output of as.POSIXlt each date and time > >> > element is separate, so input such as 2016-03-03 12:00 would now be > 2016 > >> 03 > >> > 03 12 00 (I've not read how the elements are separated). (The TZ is > not > >> > important because all data are either PST or PDT.) > >> > >> Using this script: > >> discharge <- read.csv('../data/water/discharge.dat', header = TRUE, sep > = > >> ',', stringsAsFactors = FALSE) > >> discharge$sampdate <- as.POSIXlt(discharge$sampdate, tz
Re: [R] Calculate daily means from 5-minute interval data
On Fri, 3 Sep 2021, Jeff Newmiller wrote: The fact that your projects are in a single time zone is irrelevant. I am not sure how you can be so confident in saying it does not matter whether the data were recorded in PDT or PST, since if it were recorded in PDT then there would be a day in March with 23 hours and another day in November with 25 hours, but if it were recorded in PST then there would always be 24 hours in every day, and R almost always assumes daylight savings if you don't tell it otherwise! Got it, Jeff. Thanks very much. Regards, Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
On Fri, 3 Sep 2021, Rich Shepard wrote: On Thu, 2 Sep 2021, Jeff Newmiller wrote: Regardless of whether you use the lower-level split function, or the higher-level aggregate function, or the tidyverse group_by function, the key is learning how to create the column that is the same for all records corresponding to the time interval of interest. Jeff, I definitely agree with the above If you convert the sampdate to POSIXct, the tz IS important, because most of us use local timezones that respect daylight savings time, and a naive conversion of standard time will run into trouble if R is assuming daylight savings time applies. The lubridate package gets around this by always assuming UTC and giving you a function to "fix" the timezone after the conversion. I prefer to always be specific about timezones, at least by using so something like Sys.setenv( TZ = "Etc/GMT+8" ) which does not respect daylight savings. I'm not following you here. All my projects have always been in a single time zone and the data might be recorded at June 19th or November 4th but do not depend on whether the time is PDT or PST. My hosts all set the hardware clock to local time, not UTC. The fact that your projects are in a single time zone is irrelevant. I am not sure how you can be so confident in saying it does not matter whether the data were recorded in PDT or PST, since if it were recorded in PDT then there would be a day in March with 23 hours and another day in November with 25 hours, but if it were recorded in PST then there would always be 24 hours in every day, and R almost always assumes daylight savings if you don't tell it otherwise! I am also normally working with automated collection devices that record data in standard time year round. But if you fail to tell R that this is the case, then it will almost always assume your data are stored with daylight savings time and screw up the conversion to computable time format. This screw up may include NA values in spring time when standard time has perfectly valid times between 1am and 2am on the changeover day, but in daylight time those timestamps would be invalid and will end up as NA values in your timestamp column. As the location(s) at which data are collected remain fixed geographically I don't understand why daylight savings time, or non-daylight savings time is important. I am telling you that it is important _TO R_ if you use POSIXt times. Acknowledge this and move on with life, or avoid POSIXt data. As I said, one way to acknowledge this while limiting the amount of attention you have to give to the problem is to use UTC/GMT everywhere... but this can lead to weird time of day problems as I pointed out in my timestamp cleaning slides: https://jdnewmil.github.io/time-2018-10/TimestampCleaning.html If you want to use GMT everywhere... then you have to use GMT explicitly because the default timezone in R is practically never GMT for most people. You. Need. To. Be. Explicit. Don't fight it. Just do it. It isn't hard. Regarding using character data for identifying the month, in order to have clean plots of the data I prefer to use the trunc function but it returns a POSIXlt so I convert it to POSIXct: I don't use character data for months, as far as I know. If a sample data is, for example, 2021-09-03 then monthly summaries are based on '09', not 'September.' You are taking this out of context and complaining that it has no context. This was a reply to a response by Andrew Simmons in which he used the "format" function to create unique year/month strings to act as group-by data. Earlier, when I originally responded to clarify how you could use the dplyr group_by function, I used your character date column without combining it with time or convertint to Date at all. If you studied these responses more carefully you would indeed have been using character data for grouping in some cases, and my only point was that doing so can indeed be a shortcut to the immediate answer while being troublesome later in the analysis. Accusing you of mishandling data was not my intention. I've always valued your inputs to help me understand what I don't. In this case I'm really lost in understanding your position. I hope my comments are clear enough now. Have a good Labor Day weekend, Thanks! (Not relevant to many on this list.) --- Jeff NewmillerThe . . Go Live... DCN:Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the
Re: [R] Calculate daily means from 5-minute interval data
On Thu, 2 Sep 2021, Jeff Newmiller wrote: Regardless of whether you use the lower-level split function, or the higher-level aggregate function, or the tidyverse group_by function, the key is learning how to create the column that is the same for all records corresponding to the time interval of interest. Jeff, I definitely agree with the above If you convert the sampdate to POSIXct, the tz IS important, because most of us use local timezones that respect daylight savings time, and a naive conversion of standard time will run into trouble if R is assuming daylight savings time applies. The lubridate package gets around this by always assuming UTC and giving you a function to "fix" the timezone after the conversion. I prefer to always be specific about timezones, at least by using so something like Sys.setenv( TZ = "Etc/GMT+8" ) which does not respect daylight savings. I'm not following you here. All my projects have always been in a single time zone and the data might be recorded at June 19th or November 4th but do not depend on whether the time is PDT or PST. My hosts all set the hardware clock to local time, not UTC. As the location(s) at which data are collected remain fixed geographically I don't understand why daylight savings time, or non-daylight savings time is important. Regarding using character data for identifying the month, in order to have clean plots of the data I prefer to use the trunc function but it returns a POSIXlt so I convert it to POSIXct: I don't use character data for months, as far as I know. If a sample data is, for example, 2021-09-03 then monthly summaries are based on '09', not 'September.' I've always valued your inputs to help me understand what I don't. In this case I'm really lost in understanding your position. Have a good Labor Day weekend, Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
On Thu, 2 Sep 2021, Jeff Newmiller wrote: Regardless of whether you use the lower-level split function, or the higher-level aggregate function, or the tidyverse group_by function, the key is learning how to create the column that is the same for all records corresponding to the time interval of interest. Jeff, I tried responding to only you but my message bounced: : host d9300a.ess.barracudanetworks.com[209.222.82.252] said: 550 permanent failure for one or more recipients (jdnew...@dcn.davis.ca.us:blocked) (in reply to end of DATA command) My response was not pertininet to the entire list, IMO, so I sent it to your address. Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
Regardless of whether you use the lower-level split function, or the higher-level aggregate function, or the tidyverse group_by function, the key is learning how to create the column that is the same for all records corresponding to the time interval of interest. If you convert the sampdate to POSIXct, the tz IS important, because most of us use local timezones that respect daylight savings time, and a naive conversion of standard time will run into trouble if R is assuming daylight savings time applies. The lubridate package gets around this by always assuming UTC and giving you a function to "fix" the timezone after the conversion. I prefer to always be specific about timezones, at least by using so something like Sys.setenv( TZ = "Etc/GMT+8" ) which does not respect daylight savings. Regarding using character data for identifying the month, in order to have clean plots of the data I prefer to use the trunc function but it returns a POSIXlt so I convert it to POSIXct: discharge$sampmonthbegin <- as.POSIXct( trunc( discharge$sampdate, units = "months" ) ) Then any of various ways can be used to aggregate the records by that column. On September 2, 2021 12:10:15 PM PDT, Andrew Simmons wrote: >You could use 'split' to create a list of data frames, and then apply a >function to each to get the means and sds. > > >cols <- "cfs" # add more as necessary >S <- split(discharge[cols], format(discharge$sampdate, format = "%Y-%m")) >means <- do.call("rbind", lapply(S, colMeans, na.rm = TRUE)) >sds <- do.call("rbind", lapply(S, function(xx) sapply(xx, sd, na.rm = >TRUE))) > >On Thu, Sep 2, 2021 at 3:01 PM Rich Shepard >wrote: > >> On Thu, 2 Sep 2021, Rich Shepard wrote: >> >> > If I correctly understand the output of as.POSIXlt each date and time >> > element is separate, so input such as 2016-03-03 12:00 would now be 2016 >> 03 >> > 03 12 00 (I've not read how the elements are separated). (The TZ is not >> > important because all data are either PST or PDT.) >> >> Using this script: >> discharge <- read.csv('../data/water/discharge.dat', header = TRUE, sep = >> ',', stringsAsFactors = FALSE) >> discharge$sampdate <- as.POSIXlt(discharge$sampdate, tz = "", >> format = '%Y-%m-%d %H:%M', >> optional = 'logical') >> discharge$cfs <- as.numeric(discharge$cfs, length = 6) >> >> I get this result: >> > head(discharge) >> sampdatecfs >> 1 2016-03-03 12:00:00 149000 >> 2 2016-03-03 12:10:00 15 >> 3 2016-03-03 12:20:00 151000 >> 4 2016-03-03 12:30:00 156000 >> 5 2016-03-03 12:40:00 154000 >> 6 2016-03-03 12:50:00 15 >> >> I'm completely open to suggestions on using this output to calculate >> monthly >> means and sds. >> >> If dplyr:summarize() will do so please show me how to modify this command: >> disc_monthly <- ( discharge >> %>% group_by(sampdate) >> %>% summarize(exp_value = mean(cfs, na.rm = TRUE)) >> because it produces daily means, not monthly means. >> >> TIA, >> >> Rich >> >> __ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] > >__ >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. -- Sent from my phone. Please excuse my brevity. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
On Thu, 2 Sep 2021, Andrew Simmons wrote: You could use 'split' to create a list of data frames, and then apply a function to each to get the means and sds. cols <- "cfs" # add more as necessary S <- split(discharge[cols], format(discharge$sampdate, format = "%Y-%m")) means <- do.call("rbind", lapply(S, colMeans, na.rm = TRUE)) sds <- do.call("rbind", lapply(S, function(xx) sapply(xx, sd, na.rm = TRUE))) Andrew, Thank you for the valuable lesson. This is new to me and I know I'll have use for it in the future, too. Much appreciated! Stay well, Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
You could use 'split' to create a list of data frames, and then apply a function to each to get the means and sds. cols <- "cfs" # add more as necessary S <- split(discharge[cols], format(discharge$sampdate, format = "%Y-%m")) means <- do.call("rbind", lapply(S, colMeans, na.rm = TRUE)) sds <- do.call("rbind", lapply(S, function(xx) sapply(xx, sd, na.rm = TRUE))) On Thu, Sep 2, 2021 at 3:01 PM Rich Shepard wrote: > On Thu, 2 Sep 2021, Rich Shepard wrote: > > > If I correctly understand the output of as.POSIXlt each date and time > > element is separate, so input such as 2016-03-03 12:00 would now be 2016 > 03 > > 03 12 00 (I've not read how the elements are separated). (The TZ is not > > important because all data are either PST or PDT.) > > Using this script: > discharge <- read.csv('../data/water/discharge.dat', header = TRUE, sep = > ',', stringsAsFactors = FALSE) > discharge$sampdate <- as.POSIXlt(discharge$sampdate, tz = "", > format = '%Y-%m-%d %H:%M', > optional = 'logical') > discharge$cfs <- as.numeric(discharge$cfs, length = 6) > > I get this result: > > head(discharge) > sampdatecfs > 1 2016-03-03 12:00:00 149000 > 2 2016-03-03 12:10:00 15 > 3 2016-03-03 12:20:00 151000 > 4 2016-03-03 12:30:00 156000 > 5 2016-03-03 12:40:00 154000 > 6 2016-03-03 12:50:00 15 > > I'm completely open to suggestions on using this output to calculate > monthly > means and sds. > > If dplyr:summarize() will do so please show me how to modify this command: > disc_monthly <- ( discharge > %>% group_by(sampdate) > %>% summarize(exp_value = mean(cfs, na.rm = TRUE)) > because it produces daily means, not monthly means. > > TIA, > > Rich > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
On Thu, 2 Sep 2021, Rich Shepard wrote: If I correctly understand the output of as.POSIXlt each date and time element is separate, so input such as 2016-03-03 12:00 would now be 2016 03 03 12 00 (I've not read how the elements are separated). (The TZ is not important because all data are either PST or PDT.) Using this script: discharge <- read.csv('../data/water/discharge.dat', header = TRUE, sep = ',', stringsAsFactors = FALSE) discharge$sampdate <- as.POSIXlt(discharge$sampdate, tz = "", format = '%Y-%m-%d %H:%M', optional = 'logical') discharge$cfs <- as.numeric(discharge$cfs, length = 6) I get this result: head(discharge) sampdatecfs 1 2016-03-03 12:00:00 149000 2 2016-03-03 12:10:00 15 3 2016-03-03 12:20:00 151000 4 2016-03-03 12:30:00 156000 5 2016-03-03 12:40:00 154000 6 2016-03-03 12:50:00 15 I'm completely open to suggestions on using this output to calculate monthly means and sds. If dplyr:summarize() will do so please show me how to modify this command: disc_monthly <- ( discharge %>% group_by(sampdate) %>% summarize(exp_value = mean(cfs, na.rm = TRUE)) because it produces daily means, not monthly means. TIA, Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
On Mon, 30 Aug 2021, Richard O'Keefe wrote: x <- rnorm(samples.per.day * 365) length(x) [1] 105120 Reshape the fake data into a matrix where each row represents one 24-hour period. m <- matrix(x, ncol=samples.per.day, byrow=TRUE) Richard, Now I understand the need to keep the date and time as a single datetime column; separately dplyr's sumamrize() provides daily means (too many data points to plot over 3-5 years). I reformatted the data to provide a sampledatetime column and a values column. If I correctly understand the output of as.POSIXlt each date and time element is separate, so input such as 2016-03-03 12:00 would now be 2016 03 03 12 00 (I've not read how the elements are separated). (The TZ is not important because all data are either PST or PDT.) Now we can summarise the rows any way we want. The basic tool here is ?apply. ?rowMeans is said to be faster than using apply to calculate means, so we'll use that. There is no *rowSds so we have to use apply for the standard deviation. I use ?head because I don't want to post tens of thousands of meaningless numbers. If I create a matrix using the above syntax the resulting rows contain all recorded values for a specific day. What would be the syntax to collect all values for each month? This would result in 12 rows per year; the periods of record for the five variables availble from that gauge station vary in length. Regards, Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data [RESOLVED]
On Tue, 31 Aug 2021, Jeff Newmiller wrote: Never use stringsAsFactors on uncleaned data. For one thing you give a factor to as.Date and it tries to make sense of the integer representation, not the character representation. Jeff, Oops! I had changed it in a previous version of the script and for got to change it back again. Fixed dtad <- ( dta %>% group_by( sampdate ) %>% summarise( exp_value = mean(cfs, na.rm = TRUE) , Count = n() ) ) Thank you. Now I understand how to use dplyr's summarize(). Best regards, Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
On Wed, 1 Sep 2021, Richard O'Keefe wrote: You have missed the point. The issue is not the temporal distance, but the fact that the data you have are NOT the raw instrumental data and are NOT subject to the limitations of the recording instruments. The data you get from the USGS is not the raw instrumental value, and there is no longer any good reason for there to be any gaps in it. Indeed, the Rogue River data I looked at explicitly includes some flows labelled "Ae" meaning that they are NOT the instrumental data at all, but estimated. Richard, Thanks for your comments. Regards, Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
I wrote: > > By the time you get the data from the USGS, you are already far past the > > point > > where what the instruments can write is important. Rich Shepard replied: > The data are important because they show what's happened in that period of > record. Don't physicians take a medical history from patients even though > those data are far past the point they occurred? You have missed the point. The issue is not the temporal distance, but the fact that the data you have are NOT the raw instrumental data and are NOT subject to the limitations of the recording instruments. The data you get from the USGS is not the raw instrumental value, and there is no longer any good reason for there to be any gaps in it. Indeed, the Rogue River data I looked at explicitly includes some flows labelled "Ae" meaning that they are NOT the instrumental data at all, but estimated. > And I use emacs to replace the space between columns with commas so the date > and the time are separate. There does not seem to be any good reason for this. As I demonstrated, it is easy to convert these timestamps to POSIXct form, which is good for calculating with. If you want to extract year, month, day, , by far the easiest way is to convert to POSIXlt form (so keeping the timestamp as a single field) and then use $ to extract the field. > n <- as.POSIXlt("2003.04.05 06:07", format="%Y.%m.%d %H:%M", tz="UTC") > n [1] "2003-04-05 06:07:00 UTC" > c(n$year+1900, n$mon+1, n$mday, n$hour, $min) [1] 20034567 > > The flow is dominated by a series of "bursts" with a fast onset to a peak > > and a slow decay, coming in a range of sizes from quite small to rather > > large, separated by gaps of 4 to 45 days. > > And when discharge is controlled by flows through a hydroelectric dam there > is a lot of variability. The pattern is important to fish as well as > regulators. And what is important to fish is NOT captured by daily means and standard deviations. For what it's worth, my understanding is that most of the dams on the Rogue River have been removed, leaving only the Lost Creek Lake one, and that this has been good for the fish. Suppose you have a day when there are 16 hours with no water at all flowing, then 8 hours with 12 cumecs because a dam upstream is discharging. Then the daily mean is 4 cumecs, which might look good for fish, but it wasn't. "Number of minutes below minimum safe level" might be more interesting for the fish. >From the data we have alone, we cannot tell which bursts are due to releases from dams and which have other causes. Dam releases are under human control, storms are not. Looking at the Rogue River data, plotting daily means - lowers the peaks - moves them right - changes the overall shape Not severely, mind you, but enough to avoid if you don't have to. By the way, by far the easiest way to do day-wise summaries, if you really feel you must, is to start with a POSIXct or POSIXlt column, let's call it r$when, then d <- trunc(difftime(r$when, min(r$when), units="days)) + 1 m <- aggregate(r$flow, by=list(d), FUN=mean) plot(m, type="l") You can plug in other summary functions, not just mean. Remember: for all calculations involving dates and times, prefer using the built in date and time classes to hacking around the problem aggregate() is a good way to compute oddball summaries. > > - how do I *detect* these bursts? (detecting a peak isn't too hard, > > but the peak is not the onset) > > - how do I *characterise* these bursts? > > (and is the onset rate related to the peak size?) > > - what's left after taking the bursts out? > > - can I relate these bursts to something going on upstream? > > Well, those questions could be appropriate depending on what questions you > need the data to answer. > > Environmental data are quite different from experimental, economic, > financial, and public data (e.g., unemployment, housing costs). > > There are always multiple ways to address an analytical need. Thank you for > your contributions. > > Stay well, > > Rich > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
Never use stringsAsFactors on uncleaned data. For one thing you give a factor to as.Date and it tries to make sense of the integer representation, not the character representation. library(dplyr) dta <- read.csv( text = "sampdate,samptime,cfs 2020-08-26,09:30,136000 2020-08-26,09:35,126000 2020-08-26,09:40,13 2020-08-26,09:45,128000 2020-08-26,09:50,126000 2020-08-26,09:55,125000 2020-08-26,10:00,121000 2020-08-26,10:05,117000 2020-08-26,10:10,12 ", stringsAsFactors = FALSE) dtad <- ( dta %>% group_by( sampdate ) %>% summarise( exp_value = mean(cfs, na.rm = TRUE) , Count = n() ) ) On August 31, 2021 2:11:05 PM PDT, Rich Shepard wrote: >On Sun, 29 Aug 2021, Jeff Newmiller wrote: > >> The general idea is to create a "grouping" column with repeated values for >> each day, and then to use aggregate to compute your combined results. The >> dplyr package's group_by/summarise functions can also do this, and there >> are also proponents of the data.table package which is high performance >> but tends to depend on altering data in-place unlike most other R data >> handling functions. > >Jeff, > >I've read a number of docs discussing dplyr's summerize and group_by >functions (including that section of Hadley's 'R for Data Science' book, yet >I'm missing something; I think that I need to separate the single sampdate >column into colums for year, month, and day and group_by year/month >summarizing within those groups. > >The data are of this format: >sampdate,samptime,cfs >2020-08-26,09:30,136000 >2020-08-26,09:35,126000 >2020-08-26,09:40,13 >2020-08-26,09:45,128000 >2020-08-26,09:50,126000 >2020-08-26,09:55,125000 >2020-08-26,10:00,121000 >2020-08-26,10:05,117000 >2020-08-26,10:10,12 > >My curent script is: > >---8<-- >library('tidyverse') > >discharge <- read.table('../data/discharge.dat', header = TRUE, sep = ',', >stringsAsFactors = TRUE) >discharge$sampdate <- as.Date(discharge$sampdate) >discharge$cfs <- as.numeric(discharge$cfs, length = 6) > ># use dplyr.summarize grouped by date > ># need to separate sampdate into %Y-%M-%D in order to group_by the month? >by_month <- discharge %>% > group_by(sampdate ... >summarize(by_month, exp_value = mean(cfs, na.rm = TRUE), sd(cfs)) >>8 > >and the results are: > >> str(discharge) >'data.frame': 93254 obs. of 3 variables: > $ sampdate: Date, format: "2020-08-26" "2020-08-26" ... > $ samptime: Factor w/ 728 levels "00:00","00:05",..: 115 116 117 118 123 128 > 133 138 143 148 ... > $ cfs : num 176 156 165 161 156 154 144 137 142 142 ... >> ls() >[1] "by_month" "discharge" >> by_month ># A tibble: 93,254 × 3 ># Groups: sampdate [322] >sampdate samptime cfs > > 1 2020-08-26 09:30 176 > 2 2020-08-26 09:35 156 > 3 2020-08-26 09:40 165 > 4 2020-08-26 09:45 161 > 5 2020-08-26 09:50 156 > 6 2020-08-26 09:55 154 > 7 2020-08-26 10:00 144 > 8 2020-08-26 10:05 137 > 9 2020-08-26 10:10 142 >10 2020-08-26 10:15 142 ># … with 93,244 more rows > >I don't know why the discharge values are truncated to 3 digits when they're >6 digits in the input data. > >Suggested readings appreciated, > >Rich > >__ >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. -- Sent from my phone. Please excuse my brevity. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
On Sun, 29 Aug 2021, Jeff Newmiller wrote: The general idea is to create a "grouping" column with repeated values for each day, and then to use aggregate to compute your combined results. The dplyr package's group_by/summarise functions can also do this, and there are also proponents of the data.table package which is high performance but tends to depend on altering data in-place unlike most other R data handling functions. Jeff, I've read a number of docs discussing dplyr's summerize and group_by functions (including that section of Hadley's 'R for Data Science' book, yet I'm missing something; I think that I need to separate the single sampdate column into colums for year, month, and day and group_by year/month summarizing within those groups. The data are of this format: sampdate,samptime,cfs 2020-08-26,09:30,136000 2020-08-26,09:35,126000 2020-08-26,09:40,13 2020-08-26,09:45,128000 2020-08-26,09:50,126000 2020-08-26,09:55,125000 2020-08-26,10:00,121000 2020-08-26,10:05,117000 2020-08-26,10:10,12 My curent script is: ---8<-- library('tidyverse') discharge <- read.table('../data/discharge.dat', header = TRUE, sep = ',', stringsAsFactors = TRUE) discharge$sampdate <- as.Date(discharge$sampdate) discharge$cfs <- as.numeric(discharge$cfs, length = 6) # use dplyr.summarize grouped by date # need to separate sampdate into %Y-%M-%D in order to group_by the month? by_month <- discharge %>% group_by(sampdate ... summarize(by_month, exp_value = mean(cfs, na.rm = TRUE), sd(cfs)) >8 and the results are: str(discharge) 'data.frame': 93254 obs. of 3 variables: $ sampdate: Date, format: "2020-08-26" "2020-08-26" ... $ samptime: Factor w/ 728 levels "00:00","00:05",..: 115 116 117 118 123 128 133 138 143 148 ... $ cfs : num 176 156 165 161 156 154 144 137 142 142 ... ls() [1] "by_month" "discharge" by_month # A tibble: 93,254 × 3 # Groups: sampdate [322] sampdate samptime cfs 1 2020-08-26 09:30 176 2 2020-08-26 09:35 156 3 2020-08-26 09:40 165 4 2020-08-26 09:45 161 5 2020-08-26 09:50 156 6 2020-08-26 09:55 154 7 2020-08-26 10:00 144 8 2020-08-26 10:05 137 9 2020-08-26 10:10 142 10 2020-08-26 10:15 142 # … with 93,244 more rows I don't know why the discharge values are truncated to 3 digits when they're 6 digits in the input data. Suggested readings appreciated, Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
On Tue, 31 Aug 2021, Richard O'Keefe wrote: By the time you get the data from the USGS, you are already far past the point where what the instruments can write is important. Richard, The data are important because they show what's happened in that period of record. Don't physicians take a medical history from patients even though those data are far past the point they occurred? agency_cd site_no datetime tz_cd 71932_00060 71932_00060_cd 5s 15s 20d 6s 14n 10s (I do not know what the last line signifies.) The numbers represent the space for each fixed-width field. After using read.delim to read the file I note that the timestamps are in a single column, formatted like "2020-08-30 00:15", matching the pattern "%Y-%m-%d %H:%M". After reading the data into R and using r$datetime <- as.POSIXct(r$datetime, format="%Y-%m-%d %H:%M", tz=r$tz_cd) And I use emacs to replace the space between columns with commas so the date and the time are separate. So for this data set, spanning one year, all the times are in the same time zone, observations are 15 minutes apart, not 5, and there are no missing data. This was obviously the wrong data set. As I provided when I first asked for suggestions: sampdate,samptime,cfs 2020-08-26,09:30,136000 2020-08-26,09:35,126000 2020-08-26,09:40,13 2020-08-26,09:45,128000 2020-08-26,09:50,126000 2020-08-26,09:55,125000 The recorded values are 5 minutes apart. That data set is immaterial for my project but perfect when one needs data from that gauge station on the Rogue River. The flow is dominated by a series of "bursts" with a fast onset to a peak and a slow decay, coming in a range of sizes from quite small to rather large, separated by gaps of 4 to 45 days. And when discharge is controlled by flows through a hydroelectric dam there is a lot of variability. The pattern is important to fish as well as regulators. I'd be looking at - how do I *detect* these bursts? (detecting a peak isn't too hard, but the peak is not the onset) - how do I *characterise* these bursts? (and is the onset rate related to the peak size?) - what's left after taking the bursts out? - can I relate these bursts to something going on upstream? Well, those questions could be appropriate depending on what questions you need the data to answer. Environmental data are quite different from experimental, economic, financial, and public data (e.g., unemployment, housing costs). There are always multiple ways to address an analytical need. Thank you for your contributions. Stay well, Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
By the time you get the data from the USGS, you are already far past the point where what the instruments can write is important. (Obviously an instrument can be sufficiently broken that it cannot write anything.) The data for Rogue River that I just downloaded include this comment: # Data for the following 1 site(s) are contained in this file #USGS 04118500 ROGUE RIVER NEAR ROCKFORD, MI # --- # # Data provided for site 04118500 #TS parameter Description # 71932 00060 Discharge, cubic feet per second # # Data-value qualification codes included in this output: # A Approved for publication -- Processing and review completed. # P Provisional data subject to revision. # e Value has been estimated. # agency_cd site_no datetime tz_cd 71932_00060 71932_00060_cd 5s 15s 20d 6s 14n 10s (I do not know what the last line signifies.) It is, I think, sufficiently clear that the instrument does not know what the qualification code is! After using read.delim to read the file I note that the timestamps are in a single column, formatted like "2020-08-30 00:15", matching the pattern "%Y-%m-%d %H:%M". After reading the data into R and using r$datetime <- as.POSIXct(r$datetime, format="%Y-%m-%d %H:%M", tz=r$tz_cd) I get agency sitedatetime tz USGS:33550 Min. :4118500 Min. :2020-08-30 00:00:00 EST:33550 1st Qu.:4118500 1st Qu.:2020-11-25 13:33:45 Median :4118500 Median :2021-03-08 03:52:30 Mean :4118500 Mean :2021-03-01 07:05:54 3rd Qu.:4118500 3rd Qu.:2021-06-03 12:41:15 Max. :4118500 Max. :2021-08-30 22:00:00 flowqual Min. : 96.5 A :18052 1st Qu.:156.0 A:e: 757 Median :193.0 P :14741 Mean :212.5 3rd Qu.:237.0 Max. :767.0 So for this data set, spanning one year, all the times are in the same time zone, observations are 15 minutes apart, not 5, and there are no missing data. This was obviously the wrong data set. Oh well, picking an epoch such as > epoch <- min(r$datetime) and then calculating as.numeric(difftime(timestamp, epoch, units="min"))) will give you a minute count from which determining day number and bucket within day is trivial arithmetic. I have attached a plot of the Rogue River flows which should make it very clear what I mean by saying that means and standard deviations are not a good way to characterise this kind of data. The flow is dominated by a series of "bursts" with a fast onset to a peak and a slow decay, coming in a range of sizes from quite small to rather large, separated by gaps of 4 to 45 days. I'd be looking at - how do I *detect* these bursts? (detecting a peak isn't too hard, but the peak is not the onset) - how do I *characterise* these bursts? (and is the onset rate related to the peak size?) - what's left after taking the bursts out? - can I relate these bursts to something going on upstream? My usual recommendation is to start with things available in R out of the box in order to reduce learning time. On Tue, 31 Aug 2021 at 11:34, Rich Shepard wrote: > > On Tue, 31 Aug 2021, Richard O'Keefe wrote: > > > I made up fake data in order to avoid showing untested code. It's not part > > of the process I was recommending. I expect data recorded every N minutes > > to use NA when something is missing, not to simply not be recorded. Well > > and good, all that means is that reshaping the data is not a trivial call > > to matrix(). It does not mean that any additional package is needed or > > appropriate and it does not affect the rest of the process. > > Richard, > > The instruments in the gauge pipe don't know to write NA when they're not > measuring. :-) The outage period varies greatly by location, constituent > measured, and other unknown factors. > > > You will want the POSIXct class, see ?DateTimeClasses. Do you know whether > > the time stamps are in universal time or in local time? > > The data values are not timestamps. There's one column for date a second > colume for time and a third column for time zone (P in the case of the west > coast. > > > Above all, it doesn't affect the point that you probably should not > > be doing any of this. > > ? (Doesn't require an explanation.) > > Rich > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. Rogue River.pdf Description: Adobe PDF document __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
Re: [R] Calculate daily means from 5-minute interval data
I do not wish to express any opinion on what should be done or how. But... 1. I assume that when data are missing, they are missing -- i.e. simply not present in the data. So there will be possibly several/many in succession missing rows of data corresponding to those times, right? (Apologies for being a bit dumb about this, but I always need to check that what I think is blindingly obvious really is). 2. Do note that when one takes daily averages/sd's/whatever summaries of data that, because of missingness, may be calculated from possibly quite different numbers of data points -- are whole days sometimes missing?? -- then all the summaries (e.g. means) are not created equal: summaries created from more data are more "trustworthy" and should receive "appropriately" greater weight than those created from fewer. Makes sense, right? So I suspect that this may not be as straightforward as you think -- you may wish to find a local statistician with some experience in these sorts of things to help you deal with them. Up to you, of course. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Mon, Aug 30, 2021 at 4:34 PM Rich Shepard wrote: > > On Tue, 31 Aug 2021, Richard O'Keefe wrote: > > > I made up fake data in order to avoid showing untested code. It's not part > > of the process I was recommending. I expect data recorded every N minutes > > to use NA when something is missing, not to simply not be recorded. Well > > and good, all that means is that reshaping the data is not a trivial call > > to matrix(). It does not mean that any additional package is needed or > > appropriate and it does not affect the rest of the process. > > Richard, > > The instruments in the gauge pipe don't know to write NA when they're not > measuring. :-) The outage period varies greatly by location, constituent > measured, and other unknown factors. > > > You will want the POSIXct class, see ?DateTimeClasses. Do you know whether > > the time stamps are in universal time or in local time? > > The data values are not timestamps. There's one column for date a second > colume for time and a third column for time zone (P in the case of the west > coast. > > > Above all, it doesn't affect the point that you probably should not > > be doing any of this. > > ? (Doesn't require an explanation.) > > Rich > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
Am I seeing an odd aspect to this discussion. There are many ways to solve problems and some may be favored by some more than others. All require some examination of the data so it can be massaged into shape for the processes that follow. If you insist on using the matrix method to arrange that each row or column has the data you want, then, yes, you need to guarantee all your data is present and in the right order. If some may be missing, you may want to write a program that generates all possible dates in order and interpolates them back (or into a copy more likely) so all the missing items are represented and show up as an NA or whatever you want. You may also want to check all dates are in order with no duplicates and anything else that makes sense and then you are free to ask the vector to be seen as a matrix with N columns or rows. For many, the solution is much cleaner to use constructs that may be more resistant to imperfections or allow them to be treated better. I would probably use tidyverse functionality these days but can easily understand people preferring base R or other packages. I have done similar analyses of real data gathered from streams of various chemicals and levels taken at various times and depths including times no measures happened and times there were more than one measure. It is thus much more robust to use methods like group_by and then apply other such verbs already being done grouped and especially when the next steps involved making plots with ggplot. It was rather trivial for example, to replace multiple measures by the average of the measures. And many of my plots are faceted by variables which is not trivial to do in base R. I suggest not falling in love with the first way you think of and try to bend everything to fit. Yes, some methods may be quite a bit more efficient but rarely do I run into problems even with quite large collections of data like a quarter million rows with dozens of columns, including odd columns like the output of some analysis. And note the current set of data may be extended with more over time or you may get other data collected that would not necessarily work well with a hard-coded method but might easily adjust to a new method. -Original Message- From: R-help On Behalf Of Rich Shepard Sent: Monday, August 30, 2021 7:34 PM To: R Project Help Subject: Re: [R] Calculate daily means from 5-minute interval data On Tue, 31 Aug 2021, Richard O'Keefe wrote: > I made up fake data in order to avoid showing untested code. It's not > part of the process I was recommending. I expect data recorded every N > minutes to use NA when something is missing, not to simply not be > recorded. Well and good, all that means is that reshaping the data is > not a trivial call to matrix(). It does not mean that any additional > package is needed or appropriate and it does not affect the rest of the process. Richard, The instruments in the gauge pipe don't know to write NA when they're not measuring. :-) The outage period varies greatly by location, constituent measured, and other unknown factors. > You will want the POSIXct class, see ?DateTimeClasses. Do you know > whether the time stamps are in universal time or in local time? The data values are not timestamps. There's one column for date a second colume for time and a third column for time zone (P in the case of the west coast. > Above all, it doesn't affect the point that you probably should not be > doing any of this. ? (Doesn't require an explanation.) Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
On Tue, 31 Aug 2021, Richard O'Keefe wrote: I made up fake data in order to avoid showing untested code. It's not part of the process I was recommending. I expect data recorded every N minutes to use NA when something is missing, not to simply not be recorded. Well and good, all that means is that reshaping the data is not a trivial call to matrix(). It does not mean that any additional package is needed or appropriate and it does not affect the rest of the process. Richard, The instruments in the gauge pipe don't know to write NA when they're not measuring. :-) The outage period varies greatly by location, constituent measured, and other unknown factors. You will want the POSIXct class, see ?DateTimeClasses. Do you know whether the time stamps are in universal time or in local time? The data values are not timestamps. There's one column for date a second colume for time and a third column for time zone (P in the case of the west coast. Above all, it doesn't affect the point that you probably should not be doing any of this. ? (Doesn't require an explanation.) Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
I made up fake data in order to avoid showing untested code. It's not part of the process I was recommending. I expect data recorded every N minutes to use NA when something is missing, not to simply not be recorded. Well and good, all that means is that reshaping the data is not a trivial call to matrix(). It does not mean that any additional package is needed or appropriate and it does not affect the rest of the process. You will want the POSIXct class, see ?DateTimeClasses. Do you know whether the time stamps are in universal time or in local time? Above all, it doesn't affect the point that you probably should not be doing any of this. On Tue, 31 Aug 2021 at 00:42, Rich Shepard wrote: > > On Mon, 30 Aug 2021, Richard O'Keefe wrote: > > > Why would you need a package for this? > >> samples.per.day <- 12*24 > > > > That's 12 5-minute intervals per hour and 24 hours per day. > > Generate some fake data. > > Richard, > > The problem is that there are days with fewer than 12 recorded values for > various reasons. > > When testing algorithms I use small subsets of actual data rather than fake > data. > > Thanks for your detailed procedure. > > Regards, > > Rich > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
On Mon, 30 Aug 2021, Richard O'Keefe wrote: Why would you need a package for this? samples.per.day <- 12*24 That's 12 5-minute intervals per hour and 24 hours per day. Generate some fake data. Richard, The problem is that there are days with fewer than 12 recorded values for various reasons. When testing algorithms I use small subsets of actual data rather than fake data. Thanks for your detailed procedure. Regards, Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
It is not clear to me who Jeff Newmiller's comment about periodicity is addressed to. The original poster, for asking for daily summaries? A summary of what I wrote: - daily means and standard deviations are a very poor choice for river flow data - if you insist on doing that anyway, no fancy packages are required, just reshape the data into a matrix where rows correspond to days using matrix() and summarise it using rowMeans() and apply(... FUN=sd). - but it is quite revealing to just plot the data using image(), which makes no assumptions about periodicity or anything else, just a way of wrapping 1D data to fill a 2D space and still have interpretable axes - the river data I examined showed fairly boring time series interrupted by substantial shocks (caused by rainfall in catchment areas). New stuff... The river data I looked at came from Environment Canterbury. River flows there are driven by (a) snow-melt from the Southern Alps. which *is* roughly periodic with a period of one year and (b) rainfall events which charge the upstream catchment areas, leading to a rapid ramp up following by a slower exponential-looking decay. The (a) elemen happens to be invisible in the Environment Canterbury data, as they only release the latest month of flow data, The ratio between low flows and high flows ranged from 2 to 10 in the data I could get. The (b) component is NOT periodic and is NOT aligned with days and is NOT predictable and is extremely important. Where you begin is not with R or a search for packages but with the question "what is actually going on in the real world? What are the influences on river flow, are they natural (and which) or human (and which)?" It's going to matter a lot how much irrigation water is drawn from a river, and that may be roughly predictable. If water is occasionally diverted into another river for flood control, that's going to make a difference. If there is a dam, that's going to make a difference. Rainfall and snowmelt are going to be seasonal (in a hand-wavy sense) but differently so. And there is an equally important question: "Why am I doing this? What do I want to see in the data that doesn't already leap to the eye? What is anyone going to DO differently if they see that?" Are you interested in whether minimum flows are adequate for irrigation or whether flood control systems are adequate for high flows? Thinking about the people who might read my report, if I were tasked with analysing river data, I would want to analyse the data and present the results in such a way that most of them would say "Why did I need this guy? It's so obvious! I could have done that! (If I had ever thought of it.)" But that is because I am thinking of farmers and politicians who have other maddened grizzly bears to stun (thanks, Terry Pratchett). If writing for an audience of hydrologists and statisticians, you would make different choices. Here's a little bit of insight from the physics. Why is it that the spikes in the flows rise rapidly and fall slowly? Because the fall is limited by the rate at which the river system can carry water away, but the rate at which a storm can deliver water to the river system is not. Did I know this before looking at the ECan data? Well, I had *seen* rivers rising rapidly and falling slowly, but I had never *observed*; I had never thought about it. But now that I have, it's *obvious*: you cannot understand the river without understanding the weather that the river is subject to. Anyone who genuinely understands hydrology is looking at me sadly and saying "Just now you figured this out? At your mother's knee you didn't learn this?" But it has such repercussions. It means you need data on rainfall in the catchment areas. (Which ECan, to their credit, also provide.) In an important sense, there is no right way to analyse river flow data *on its own*. On Mon, 30 Aug 2021 at 14:47, Jeff Newmiller wrote: > > IMO assuming periodicity is a bad practice for this. Missing timestamps > happen too, and there is no reason to build a broken analysis process. > > On August 29, 2021 7:09:01 PM PDT, Richard O'Keefe wrote: > >Why would you need a package for this? > >> samples.per.day <- 12*24 > > > >That's 12 5-minute intervals per hour and 24 hours per day. > >Generate some fake data. > > > >> x <- rnorm(samples.per.day * 365) > >> length(x) > >[1] 105120 > > > >Reshape the fake data into a matrix where each row represents one > >24-hour period. > > > >> m <- matrix(x, ncol=samples.per.day, byrow=TRUE) > > > >Now we can summarise the rows any way we want. > >The basic tool here is ?apply. > >?rowMeans is said to be faster than using apply to calculate means, > >so we'll use that. There is no *rowSds so we have to use apply > >for the standard deviation. I use ?head because I don't want to > >post tens of thousands of meaningless numbers. > > > >> head(rowMeans(m)) > >[1] -0.03510177 0.11817337 0.06725203 -0.03578195 -0.02448077
Re: [R] Calculate daily means from 5-minute interval data
IMO assuming periodicity is a bad practice for this. Missing timestamps happen too, and there is no reason to build a broken analysis process. On August 29, 2021 7:09:01 PM PDT, Richard O'Keefe wrote: >Why would you need a package for this? >> samples.per.day <- 12*24 > >That's 12 5-minute intervals per hour and 24 hours per day. >Generate some fake data. > >> x <- rnorm(samples.per.day * 365) >> length(x) >[1] 105120 > >Reshape the fake data into a matrix where each row represents one >24-hour period. > >> m <- matrix(x, ncol=samples.per.day, byrow=TRUE) > >Now we can summarise the rows any way we want. >The basic tool here is ?apply. >?rowMeans is said to be faster than using apply to calculate means, >so we'll use that. There is no *rowSds so we have to use apply >for the standard deviation. I use ?head because I don't want to >post tens of thousands of meaningless numbers. > >> head(rowMeans(m)) >[1] -0.03510177 0.11817337 0.06725203 -0.03578195 -0.02448077 -0.03033692 >> head(apply(m, MARGIN=1, FUN=sd)) >[1] 1.0017718 0.9922920 1.0100550 0.9956810 1.0077477 0.9833144 > >Now whether this is a *sensible* way to summarise your flow data is a question >that a hydrologist would be better placed to answer. I would have started with >> plot(density(x)) >which I just did with some real river data (only a month of it, sigh). >Very long tail. >Even >> plot(density(log(r))) >shows a very long tail. Time to plot the data against time. Oh my! >All of the long tail came from a single event. >There's a period of low flow, then there's a big rainstorm and the >flow goes WAY up, then over about two days the flow subsides to a new >somewhat higher level. > >None of this is reflected in means or standard deviations. >This is *time series* data, and time series data of a fairly special kind. > >One thing that might be helpful with your data would simply be >> image(log(m)) >For my one month sample, the spike showed up very clearly that way. >Because right now, your first task is to get an idea of what the data >look like, and means-and-standard-deviations won't really do that. > >Oh heck, here's another reason to go with image(log(m)). >With image(m) I just see the one big spike. >With image(log(m)), I can see that little spikes often start in the >afternoon of one day and continue into the morning of the next. >From daily means, it looks like two unusual, but not very >unusual, days. From the image, it's clearly ONE rainfall event >that just happens to straddle a day boundary. > >This is all very basic stuff, which is really the point. You want to use >elementary tools to look at the data before you reach for fancy ones. > > >On Mon, 30 Aug 2021 at 03:09, Rich Shepard wrote: >> >> I have a year's hydraulic data (discharge, stage height, velocity, etc.) >> from a USGS monitoring gauge recording values every 5 minutes. The data >> files contain 90K-93K lines and plotting all these data would produce a >> solid block of color. >> >> What I want are the daily means and standard deviation from these data. >> >> As an occasional R user (depending on project needs) I've no idea what >> packages could be applied to these data frames. There likely are multiple >> paths to extracting these daily values so summary statistics can be >> calculated and plotted. I'd appreciate suggestions on where to start to >> learn how I can do this. >> >> TIA, >> >> Rich >> >> __ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > >__ >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. -- Sent from my phone. Please excuse my brevity. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
Why would you need a package for this? > samples.per.day <- 12*24 That's 12 5-minute intervals per hour and 24 hours per day. Generate some fake data. > x <- rnorm(samples.per.day * 365) > length(x) [1] 105120 Reshape the fake data into a matrix where each row represents one 24-hour period. > m <- matrix(x, ncol=samples.per.day, byrow=TRUE) Now we can summarise the rows any way we want. The basic tool here is ?apply. ?rowMeans is said to be faster than using apply to calculate means, so we'll use that. There is no *rowSds so we have to use apply for the standard deviation. I use ?head because I don't want to post tens of thousands of meaningless numbers. > head(rowMeans(m)) [1] -0.03510177 0.11817337 0.06725203 -0.03578195 -0.02448077 -0.03033692 > head(apply(m, MARGIN=1, FUN=sd)) [1] 1.0017718 0.9922920 1.0100550 0.9956810 1.0077477 0.9833144 Now whether this is a *sensible* way to summarise your flow data is a question that a hydrologist would be better placed to answer. I would have started with > plot(density(x)) which I just did with some real river data (only a month of it, sigh). Very long tail. Even > plot(density(log(r))) shows a very long tail. Time to plot the data against time. Oh my! All of the long tail came from a single event. There's a period of low flow, then there's a big rainstorm and the flow goes WAY up, then over about two days the flow subsides to a new somewhat higher level. None of this is reflected in means or standard deviations. This is *time series* data, and time series data of a fairly special kind. One thing that might be helpful with your data would simply be > image(log(m)) For my one month sample, the spike showed up very clearly that way. Because right now, your first task is to get an idea of what the data look like, and means-and-standard-deviations won't really do that. Oh heck, here's another reason to go with image(log(m)). With image(m) I just see the one big spike. With image(log(m)), I can see that little spikes often start in the afternoon of one day and continue into the morning of the next. >From daily means, it looks like two unusual, but not very unusual, days. From the image, it's clearly ONE rainfall event that just happens to straddle a day boundary. This is all very basic stuff, which is really the point. You want to use elementary tools to look at the data before you reach for fancy ones. On Mon, 30 Aug 2021 at 03:09, Rich Shepard wrote: > > I have a year's hydraulic data (discharge, stage height, velocity, etc.) > from a USGS monitoring gauge recording values every 5 minutes. The data > files contain 90K-93K lines and plotting all these data would produce a > solid block of color. > > What I want are the daily means and standard deviation from these data. > > As an occasional R user (depending on project needs) I've no idea what > packages could be applied to these data frames. There likely are multiple > paths to extracting these daily values so summary statistics can be > calculated and plotted. I'd appreciate suggestions on where to start to > learn how I can do this. > > TIA, > > Rich > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
On Sun, 29 Aug 2021, Andrew Simmons wrote: I would suggest something like: Thanks, Andrew. Stay well, Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
On Sun, 29 Aug 2021, Rui Barradas wrote: Hope this helps, Rui, Greatly! I'll study it carefully so I fully understand the process. Many thanks. Stay well, Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
Hello, I forgot in my previous answer, sorry for the duplicated mails. The function in my previous mail has a na.rm argument, defaulting to FALSE, pass na.rm = TRUE to remove the NA's. agg <- aggregate(cfs ~ date, df1, fun, na.rm = TRUE) Or simply change the default. I prefer to set na.rm = FALSE because that's what R's functions do. And I will only be used to one default, with base R functions or my own code. Hope this helps, Rui Barradas Às 17:52 de 29/08/21, Rich Shepard escreveu: On Sun, 29 Aug 2021, Jeff Newmiller wrote: The general idea is to create a "grouping" column with repeated values for each day, and then to use aggregate to compute your combined results. The dplyr package's group_by/summarise functions can also do this, and there are also proponents of the data.table package which is high performance but tends to depend on altering data in-place unlike most other R data handling functions. Also pay attention to missing data... if you have any then you will need to consider whether you want the strictness of na.rm=FALSE or permissiveness of na.rm=TRUE for your aggregation functions. Jeff, Thank you. Yes, there are missing data as sometimes the equipment fails, or there's some other reason why some samples are missing. Grouping on each day is just what I need. I'll re-learn dplyr and take a look at data.table. Regards, Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
On Sun, 29 Aug 2021, Jeff Newmiller wrote: You may find something useful on handling timestamp data here: https://jdnewmil.github.io/ Jeff, I'll certainly read those articles. Many thanks, Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
Hello, You have date and hour in two separate columns, so to compute daily stats part of the work is already done. (Were they in the same column you would have to extract the date only.) # convert to class "Date" df1$date <- as.Date(df1$date) # function to compute the stats required # it's important to note that all the stats # are returned in a vector, see below fun <- function(x, na.rm = FALSE){ c(mean_cfs = mean(x, na.rm = na.rm), sd_cfs = sd(x, na.rm = na.rm)) } # now this will put a *matrix* under cfs # each row has the statistics computed # by the function agg <- aggregate(cfs ~ date, df1, fun) str(agg) #'data.frame': 1 obs. of 2 variables: # $ date: Date, format: "2020-08-26" # $ cfs : num [1, 1:2] 110400 16143 # ..- attr(*, "dimnames")=List of 2 # .. ..$ : NULL # .. ..$ : chr [1:2] "mean_cfs" "sd_cfs" # so now put everything in separate columns agg <- cbind(agg[-ncol(agg)], agg[[ncol(agg)]]) str(agg) #'data.frame': 1 obs. of 3 variables: # $ date: Date, format: "2020-08-26" # $ mean_cfs: num 110400 # $ sd_cfs : num 16143 Hope this helps, Rui Barradas Às 17:49 de 29/08/21, Rich Shepard escreveu: On Sun, 29 Aug 2021, Eric Berger wrote: Provide dummy data (e.g. 5-10 lines), say like the contents of a csv file, and calculate by hand what you'd like to see in the plot. (And describe what the plot would look like.) Eric, Mea culpa! I extracted a set of sample data and forgot to include it in the message. Here it is: date,time,cfs 2020-08-26,09:30,136000 2020-08-26,09:35,126000 2020-08-26,09:40,13 2020-08-26,09:45,128000 2020-08-26,09:50,126000 2020-08-26,09:55,125000 2020-08-26,10:00,121000 2020-08-26,10:05,117000 2020-08-26,10:10,12 ... 2020-08-26,23:10,108000 2020-08-26,23:15,96200 2020-08-26,23:20,86700 2020-08-26,23:25,103000 2020-08-26,23:30,103000 2020-08-26,23:35,99500 2020-08-26,23:40,85200 2020-08-26,23:45,103000 2020-08-26,23:50,95800 2020-08-26,23:55,88200 Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
Hello, I would suggest something like: date <- seq(as.Date("2020-01-01"), as.Date("2020-12-31"), 1) time <- sprintf("%02d:%02d", rep(0:23, each = 12), seq.int(0, 55, 5)) x <- data.frame( date = rep(date, each = length(time)), time = time ) x$cfs <- stats::rnorm(nrow(x)) cols2aggregate <- "cfs" # add more as necessary S <- split(x[cols2aggregate], x$date) means <- do.call("rbind", lapply(S, colMeans, na.rm = TRUE)) sds <- do.call("rbind", lapply(S, function(xx) sapply(xx, sd, na.rm = TRUE))) On Sun, Aug 29, 2021 at 11:09 AM Rich Shepard wrote: > I have a year's hydraulic data (discharge, stage height, velocity, etc.) > from a USGS monitoring gauge recording values every 5 minutes. The data > files contain 90K-93K lines and plotting all these data would produce a > solid block of color. > > What I want are the daily means and standard deviation from these data. > > As an occasional R user (depending on project needs) I've no idea what > packages could be applied to these data frames. There likely are multiple > paths to extracting these daily values so summary statistics can be > calculated and plotted. I'd appreciate suggestions on where to start to > learn how I can do this. > > TIA, > > Rich > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
On Sun, 29 Aug 2021, Rui Barradas wrote: I forgot in my previous answer, sorry for the duplicated mails. The function in my previous mail has a na.rm argument, defaulting to FALSE, pass na.rm = TRUE to remove the NA's. agg <- aggregate(cfs ~ date, df1, fun, na.rm = TRUE) Or simply change the default. I prefer to set na.rm = FALSE because that's what R's functions do. And I will only be used to one default, with base R functions or my own code. Hope this helps, Rui, Again, yes it does. Regards, Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
You may find something useful on handling timestamp data here: https://jdnewmil.github.io/ On August 29, 2021 9:23:31 AM PDT, Jeff Newmiller wrote: >The general idea is to create a "grouping" column with repeated values for >each day, and then to use aggregate to compute your combined results. The >dplyr package's group_by/summarise functions can also do this, and there are >also proponents of the data.table package which is high performance but tends >to depend on altering data in-place unlike most other R data handling >functions. > >Also pay attention to missing data... if you have any then you will need to >consider whether you want the strictness of na.rm=FALSE or permissiveness of >na.rm=TRUE for your aggregation functions. > >On August 29, 2021 8:08:58 AM PDT, Rich Shepard >wrote: >>I have a year's hydraulic data (discharge, stage height, velocity, etc.) >>from a USGS monitoring gauge recording values every 5 minutes. The data >>files contain 90K-93K lines and plotting all these data would produce a >>solid block of color. >> >>What I want are the daily means and standard deviation from these data. >> >>As an occasional R user (depending on project needs) I've no idea what >>packages could be applied to these data frames. There likely are multiple >>paths to extracting these daily values so summary statistics can be >>calculated and plotted. I'd appreciate suggestions on where to start to >>learn how I can do this. >> >>TIA, >> >>Rich >> >>__ >>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>https://stat.ethz.ch/mailman/listinfo/r-help >>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>and provide commented, minimal, self-contained, reproducible code. > -- Sent from my phone. Please excuse my brevity. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
On Sun, 29 Aug 2021, Jeff Newmiller wrote: The general idea is to create a "grouping" column with repeated values for each day, and then to use aggregate to compute your combined results. The dplyr package's group_by/summarise functions can also do this, and there are also proponents of the data.table package which is high performance but tends to depend on altering data in-place unlike most other R data handling functions. Also pay attention to missing data... if you have any then you will need to consider whether you want the strictness of na.rm=FALSE or permissiveness of na.rm=TRUE for your aggregation functions. Jeff, Thank you. Yes, there are missing data as sometimes the equipment fails, or there's some other reason why some samples are missing. Grouping on each day is just what I need. I'll re-learn dplyr and take a look at data.table. Regards, Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
On Sun, 29 Aug 2021, Eric Berger wrote: Provide dummy data (e.g. 5-10 lines), say like the contents of a csv file, and calculate by hand what you'd like to see in the plot. (And describe what the plot would look like.) Eric, Mea culpa! I extracted a set of sample data and forgot to include it in the message. Here it is: date,time,cfs 2020-08-26,09:30,136000 2020-08-26,09:35,126000 2020-08-26,09:40,13 2020-08-26,09:45,128000 2020-08-26,09:50,126000 2020-08-26,09:55,125000 2020-08-26,10:00,121000 2020-08-26,10:05,117000 2020-08-26,10:10,12 ... 2020-08-26,23:10,108000 2020-08-26,23:15,96200 2020-08-26,23:20,86700 2020-08-26,23:25,103000 2020-08-26,23:30,103000 2020-08-26,23:35,99500 2020-08-26,23:40,85200 2020-08-26,23:45,103000 2020-08-26,23:50,95800 2020-08-26,23:55,88200 Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
The general idea is to create a "grouping" column with repeated values for each day, and then to use aggregate to compute your combined results. The dplyr package's group_by/summarise functions can also do this, and there are also proponents of the data.table package which is high performance but tends to depend on altering data in-place unlike most other R data handling functions. Also pay attention to missing data... if you have any then you will need to consider whether you want the strictness of na.rm=FALSE or permissiveness of na.rm=TRUE for your aggregation functions. On August 29, 2021 8:08:58 AM PDT, Rich Shepard wrote: >I have a year's hydraulic data (discharge, stage height, velocity, etc.) >from a USGS monitoring gauge recording values every 5 minutes. The data >files contain 90K-93K lines and plotting all these data would produce a >solid block of color. > >What I want are the daily means and standard deviation from these data. > >As an occasional R user (depending on project needs) I've no idea what >packages could be applied to these data frames. There likely are multiple >paths to extracting these daily values so summary statistics can be >calculated and plotted. I'd appreciate suggestions on where to start to >learn how I can do this. > >TIA, > >Rich > >__ >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. -- Sent from my phone. Please excuse my brevity. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Calculate daily means from 5-minute interval data
Hi Rich, Your request is a bit open-ended but here's a suggestion that might help get you an answer. Provide dummy data (e.g. 5-10 lines), say like the contents of a csv file, and calculate by hand what you'd like to see in the plot. (And describe what the plot would look like.) It sounds like what you want could be done in a few lines of R code which would work both on the dummy data and the real data. HTH, Eric On Sun, Aug 29, 2021 at 6:09 PM Rich Shepard wrote: > I have a year's hydraulic data (discharge, stage height, velocity, etc.) > from a USGS monitoring gauge recording values every 5 minutes. The data > files contain 90K-93K lines and plotting all these data would produce a > solid block of color. > > What I want are the daily means and standard deviation from these data. > > As an occasional R user (depending on project needs) I've no idea what > packages could be applied to these data frames. There likely are multiple > paths to extracting these daily values so summary statistics can be > calculated and plotted. I'd appreciate suggestions on where to start to > learn how I can do this. > > TIA, > > Rich > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Calculate daily means from 5-minute interval data
I have a year's hydraulic data (discharge, stage height, velocity, etc.) from a USGS monitoring gauge recording values every 5 minutes. The data files contain 90K-93K lines and plotting all these data would produce a solid block of color. What I want are the daily means and standard deviation from these data. As an occasional R user (depending on project needs) I've no idea what packages could be applied to these data frames. There likely are multiple paths to extracting these daily values so summary statistics can be calculated and plotted. I'd appreciate suggestions on where to start to learn how I can do this. TIA, Rich __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.