Re: [R] Calculate daily means from 5-minute interval data

2021-09-05 Thread Jeff Newmiller
This problem nearly always boils down to using meta knowledge about the file. 
Having informal TZ info in the file is very helpful, but PST is not necessarily 
a uniquely-defined time zone specification so you have to draw on information 
outside of the file to know that these codes correspond to -0800 etc. (e.g. CST 
could be China Standard Time or US Central Standard Time.) Thus, it is tough to 
make this into a broadly-useful function.

You can also construct the timezone column from knowledge about the location of 
interest and the monotonicity of the time data. 
https://jdnewmil.github.io/eci298sp2016/QuickHowtos1.html#handling-time-data 
... but the answer to "easy" seems firmly in the eyes of the beholder.

On September 5, 2021 10:18:48 AM PDT, Bill Dunlap  
wrote:
>What is the best way to read (from a text file) timestamps from the fall
>time change, where there are two 1:15am's?  E.g., here is an extract from a
>US Geological Survey web site giving data on the river through our county
>on 2020-11-01, when we changed from PDT to PST,
>https://nwis.waterdata.usgs.gov/wa/nwis/uv/?cb_00010=on_00060=on_00065=on=rdb_no=12200500=_date=2020-11-01_date=2020-11-05
>.
>
>The timestamps include the date and time as well as PDT or PST.
>
>river <-
>c("datetime,tz,discharge,height,temp",
>  "2020-11-01 00:00,PDT,20500,16.44,9.3",
>  "2020-11-01 00:15,PDT,20500,16.44,9.3",
>  "2020-11-01 00:30,PDT,20500,16.43,9.3",
>  "2020-11-01 00:45,PDT,20400,16.40,9.3",
>  "2020-11-01 01:00,PDT,20400,16.40,9.3",
>  "2020-11-01 01:00,PST,20200,16.34,9.2",
>  "2020-11-01 01:15,PDT,20400,16.39,9.3",
>  "2020-11-01 01:15,PST,20200,16.34,9.2",
>  "2020-11-01 01:30,PDT,20300,16.37,9.2",
>  "2020-11-01 01:30,PST,20100,16.31,9.2",
>  "2020-11-01 01:45,PDT,20300,16.35,9.2",
>  "2020-11-01 01:45,PST,20100,16.29,9.2",
>  "2020-11-01 02:00,PST,20100,16.29,9.2",
>  "2020-11-01 02:15,PST,2,16.27,9.1",
>  "2020-11-01 02:30,PST,2,16.26,9.1"
>  )
>d <- read.table(text=river, sep=",",header=TRUE)
>
>The entries are obviously not in time order.
>
>Is there a simple way to read the timedate and tz columns together?  One
>way is to use d$tz to construct an offset that can be read with
>strptime's "%z".
>
>> d$POSIXct <-
>as.POSIXct(paste(d$datetime,ifelse(d$tz=="PDT","-0700","-0800")),
>format="%Y-%m-%d %H:%M %z")
>> d
>   datetime  tz discharge height temp POSIXct
>1  2020-11-01 00:00 PDT 20500  16.44  9.3 2020-11-01 00:00:00
>2  2020-11-01 00:15 PDT 20500  16.44  9.3 2020-11-01 00:15:00
>3  2020-11-01 00:30 PDT 20500  16.43  9.3 2020-11-01 00:30:00
>4  2020-11-01 00:45 PDT 20400  16.40  9.3 2020-11-01 00:45:00
>5  2020-11-01 01:00 PDT 20400  16.40  9.3 2020-11-01 01:00:00
>6  2020-11-01 01:00 PST 20200  16.34  9.2 2020-11-01 01:00:00
>7  2020-11-01 01:15 PDT 20400  16.39  9.3 2020-11-01 01:15:00
>8  2020-11-01 01:15 PST 20200  16.34  9.2 2020-11-01 01:15:00
>9  2020-11-01 01:30 PDT 20300  16.37  9.2 2020-11-01 01:30:00
>10 2020-11-01 01:30 PST 20100  16.31  9.2 2020-11-01 01:30:00
>11 2020-11-01 01:45 PDT 20300  16.35  9.2 2020-11-01 01:45:00
>12 2020-11-01 01:45 PST 20100  16.29  9.2 2020-11-01 01:45:00
>13 2020-11-01 02:00 PST 20100  16.29  9.2 2020-11-01 02:00:00
>14 2020-11-01 02:15 PST 2  16.27  9.1 2020-11-01 02:15:00
>15 2020-11-01 02:30 PST 2  16.26  9.1 2020-11-01 02:30:00
>> with(d[order(d$POSIXct),], plot(temp)) # monotonic temperature
>
>-Bill
>
>
>On Thu, Sep 2, 2021 at 12:41 PM Jeff Newmiller 
>wrote:
>
>> Regardless of whether you use the lower-level split function, or the
>> higher-level aggregate function, or the tidyverse group_by function, the
>> key is learning how to create the column that is the same for all records
>> corresponding to the time interval of interest.
>>
>> If you convert the sampdate to POSIXct, the tz IS important, because most
>> of us use local timezones that respect daylight savings time, and a naive
>> conversion of standard time will run into trouble if R is assuming daylight
>> savings time applies. The lubridate package gets around this by always
>> assuming UTC and giving you a function to "fix" the timezone after the
>> conversion. I prefer to always be specific about timezones, at least by
>> using so something like
>>
>> Sys.setenv( TZ = "Etc/GMT+8" )
>>
>> which does not respect daylight savings.
>>
>> Regarding using character data for identifying the month, in order to have
>> clean plots of the data I prefer to use the trunc function but it returns a
>> POSIXlt so I convert it to POSIXct:
>>
>> discharge$sampmonthbegin <- as.POSIXct( trunc( discharge$sampdate,
>> units = "months" ) )
>>
>> Then any of various ways can be used to aggregate the records by that
>> column.
>>
>> On September 2, 2021 12:10:15 PM PDT, Andrew Simmons 
>> wrote:
>> >You could use 'split' to create a list of data frames, and then apply a
>> >function to each to get the means and sds.
>> >
>> >
>> >cols <- "cfs"  

Re: [R] Calculate daily means from 5-minute interval data

2021-09-05 Thread Bill Dunlap
What is the best way to read (from a text file) timestamps from the fall
time change, where there are two 1:15am's?  E.g., here is an extract from a
US Geological Survey web site giving data on the river through our county
on 2020-11-01, when we changed from PDT to PST,
https://nwis.waterdata.usgs.gov/wa/nwis/uv/?cb_00010=on_00060=on_00065=on=rdb_no=12200500=_date=2020-11-01_date=2020-11-05
.

The timestamps include the date and time as well as PDT or PST.

river <-
c("datetime,tz,discharge,height,temp",
  "2020-11-01 00:00,PDT,20500,16.44,9.3",
  "2020-11-01 00:15,PDT,20500,16.44,9.3",
  "2020-11-01 00:30,PDT,20500,16.43,9.3",
  "2020-11-01 00:45,PDT,20400,16.40,9.3",
  "2020-11-01 01:00,PDT,20400,16.40,9.3",
  "2020-11-01 01:00,PST,20200,16.34,9.2",
  "2020-11-01 01:15,PDT,20400,16.39,9.3",
  "2020-11-01 01:15,PST,20200,16.34,9.2",
  "2020-11-01 01:30,PDT,20300,16.37,9.2",
  "2020-11-01 01:30,PST,20100,16.31,9.2",
  "2020-11-01 01:45,PDT,20300,16.35,9.2",
  "2020-11-01 01:45,PST,20100,16.29,9.2",
  "2020-11-01 02:00,PST,20100,16.29,9.2",
  "2020-11-01 02:15,PST,2,16.27,9.1",
  "2020-11-01 02:30,PST,2,16.26,9.1"
  )
d <- read.table(text=river, sep=",",header=TRUE)

The entries are obviously not in time order.

Is there a simple way to read the timedate and tz columns together?  One
way is to use d$tz to construct an offset that can be read with
strptime's "%z".

> d$POSIXct <-
as.POSIXct(paste(d$datetime,ifelse(d$tz=="PDT","-0700","-0800")),
format="%Y-%m-%d %H:%M %z")
> d
   datetime  tz discharge height temp POSIXct
1  2020-11-01 00:00 PDT 20500  16.44  9.3 2020-11-01 00:00:00
2  2020-11-01 00:15 PDT 20500  16.44  9.3 2020-11-01 00:15:00
3  2020-11-01 00:30 PDT 20500  16.43  9.3 2020-11-01 00:30:00
4  2020-11-01 00:45 PDT 20400  16.40  9.3 2020-11-01 00:45:00
5  2020-11-01 01:00 PDT 20400  16.40  9.3 2020-11-01 01:00:00
6  2020-11-01 01:00 PST 20200  16.34  9.2 2020-11-01 01:00:00
7  2020-11-01 01:15 PDT 20400  16.39  9.3 2020-11-01 01:15:00
8  2020-11-01 01:15 PST 20200  16.34  9.2 2020-11-01 01:15:00
9  2020-11-01 01:30 PDT 20300  16.37  9.2 2020-11-01 01:30:00
10 2020-11-01 01:30 PST 20100  16.31  9.2 2020-11-01 01:30:00
11 2020-11-01 01:45 PDT 20300  16.35  9.2 2020-11-01 01:45:00
12 2020-11-01 01:45 PST 20100  16.29  9.2 2020-11-01 01:45:00
13 2020-11-01 02:00 PST 20100  16.29  9.2 2020-11-01 02:00:00
14 2020-11-01 02:15 PST 2  16.27  9.1 2020-11-01 02:15:00
15 2020-11-01 02:30 PST 2  16.26  9.1 2020-11-01 02:30:00
> with(d[order(d$POSIXct),], plot(temp)) # monotonic temperature

-Bill


On Thu, Sep 2, 2021 at 12:41 PM Jeff Newmiller 
wrote:

> Regardless of whether you use the lower-level split function, or the
> higher-level aggregate function, or the tidyverse group_by function, the
> key is learning how to create the column that is the same for all records
> corresponding to the time interval of interest.
>
> If you convert the sampdate to POSIXct, the tz IS important, because most
> of us use local timezones that respect daylight savings time, and a naive
> conversion of standard time will run into trouble if R is assuming daylight
> savings time applies. The lubridate package gets around this by always
> assuming UTC and giving you a function to "fix" the timezone after the
> conversion. I prefer to always be specific about timezones, at least by
> using so something like
>
> Sys.setenv( TZ = "Etc/GMT+8" )
>
> which does not respect daylight savings.
>
> Regarding using character data for identifying the month, in order to have
> clean plots of the data I prefer to use the trunc function but it returns a
> POSIXlt so I convert it to POSIXct:
>
> discharge$sampmonthbegin <- as.POSIXct( trunc( discharge$sampdate,
> units = "months" ) )
>
> Then any of various ways can be used to aggregate the records by that
> column.
>
> On September 2, 2021 12:10:15 PM PDT, Andrew Simmons 
> wrote:
> >You could use 'split' to create a list of data frames, and then apply a
> >function to each to get the means and sds.
> >
> >
> >cols <- "cfs"  # add more as necessary
> >S <- split(discharge[cols], format(discharge$sampdate, format = "%Y-%m"))
> >means <- do.call("rbind", lapply(S, colMeans, na.rm = TRUE))
> >sds   <- do.call("rbind", lapply(S, function(xx) sapply(xx, sd, na.rm =
> >TRUE)))
> >
> >On Thu, Sep 2, 2021 at 3:01 PM Rich Shepard 
> >wrote:
> >
> >> On Thu, 2 Sep 2021, Rich Shepard wrote:
> >>
> >> > If I correctly understand the output of as.POSIXlt each date and time
> >> > element is separate, so input such as 2016-03-03 12:00 would now be
> 2016
> >> 03
> >> > 03 12 00 (I've not read how the elements are separated). (The TZ is
> not
> >> > important because all data are either PST or PDT.)
> >>
> >> Using this script:
> >> discharge <- read.csv('../data/water/discharge.dat', header = TRUE, sep
> =
> >> ',', stringsAsFactors = FALSE)
> >> discharge$sampdate <- as.POSIXlt(discharge$sampdate, tz 

Re: [R] Calculate daily means from 5-minute interval data

2021-09-04 Thread Rich Shepard

On Fri, 3 Sep 2021, Jeff Newmiller wrote:


The fact that your projects are in a single time zone is irrelevant. I am
not sure how you can be so confident in saying it does not matter whether
the data were recorded in PDT or PST, since if it were recorded in PDT
then there would be a day in March with 23 hours and another day in
November with 25 hours, but if it were recorded in PST then there would
always be 24 hours in every day, and R almost always assumes daylight
savings if you don't tell it otherwise!


Got it, Jeff. Thanks very much.

Regards,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-09-04 Thread Jeff Newmiller

On Fri, 3 Sep 2021, Rich Shepard wrote:


On Thu, 2 Sep 2021, Jeff Newmiller wrote:


Regardless of whether you use the lower-level split function, or the
higher-level aggregate function, or the tidyverse group_by function, the
key is learning how to create the column that is the same for all records
corresponding to the time interval of interest.


Jeff,

I definitely agree with the above


If you convert the sampdate to POSIXct, the tz IS important, because most
of us use local timezones that respect daylight savings time, and a naive
conversion of standard time will run into trouble if R is assuming
daylight savings time applies. The lubridate package gets around this by
always assuming UTC and giving you a function to "fix" the timezone after
the conversion. I prefer to always be specific about timezones, at least
by using so something like
   Sys.setenv( TZ = "Etc/GMT+8" )
which does not respect daylight savings.


I'm not following you here. All my projects have always been in a single
time zone and the data might be recorded at June 19th or November 4th but do
not depend on whether the time is PDT or PST. My hosts all set the hardware
clock to local time, not UTC.


The fact that your projects are in a single time zone is irrelevant. I am 
not sure how you can be so confident in saying it does not matter whether 
the data were recorded in PDT or PST, since if it were recorded in PDT 
then there would be a day in March with 23 hours and another day in 
November with 25 hours, but if it were recorded in PST then there would 
always be 24 hours in every day, and R almost always assumes daylight 
savings if you don't tell it otherwise!


I am also normally working with automated collection devices that record 
data in standard time year round. But if you fail to tell R that this is 
the case, then it will almost always assume your data are stored with 
daylight savings time and screw up the conversion to computable time 
format. This screw up may include NA values in spring time when standard 
time has perfectly valid times between 1am and 2am on the changeover day, 
but in daylight time those timestamps would be invalid and will end up as 
NA values in your timestamp column.



As the location(s) at which data are collected remain fixed geographically I
don't understand why daylight savings time, or non-daylight savings time is
important.


I am telling you that it is important _TO R_ if you use POSIXt times. 
Acknowledge this and move on with life, or avoid POSIXt data. As I said, 
one way to acknowledge this while limiting the amount of attention you 
have to give to the problem is to use UTC/GMT everywhere... but this can 
lead to weird time of day problems as I pointed out in my timestamp 
cleaning slides: 
https://jdnewmil.github.io/time-2018-10/TimestampCleaning.html


If you want to use GMT everywhere... then you have to use GMT explicitly 
because the default timezone in R is practically never GMT for most 
people. You. Need. To. Be. Explicit. Don't fight it. Just do it. It isn't 
hard.



Regarding using character data for identifying the month, in order to have
clean plots of the data I prefer to use the trunc function but it returns
a POSIXlt so I convert it to POSIXct:


I don't use character data for months, as far as I know. If a sample data
is, for example, 2021-09-03 then monthly summaries are based on '09', not
'September.'


You are taking this out of context and complaining that it has no context. 
This was a reply to a response by Andrew Simmons in which he used the 
"format" function to create unique year/month strings to act as group-by 
data. Earlier, when I originally responded to clarify how you could use 
the dplyr group_by function, I used your character date column without 
combining it with time or convertint to Date at all. If you studied these 
responses more carefully you would indeed have been using character data 
for grouping in some cases, and my only point was that doing so can indeed 
be a shortcut to the immediate answer while being troublesome later in the 
analysis. Accusing you of mishandling data was not my intention.



I've always valued your inputs to help me understand what I don't. In this
case I'm really lost in understanding your position.


I hope my comments are clear enough now.


Have a good Labor Day weekend,


Thanks! (Not relevant to many on this list.)

---
Jeff NewmillerThe .   .  Go Live...
DCN:Basics: ##.#.   ##.#.  Live Go...
  Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
/Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the 

Re: [R] Calculate daily means from 5-minute interval data

2021-09-03 Thread Rich Shepard

On Thu, 2 Sep 2021, Jeff Newmiller wrote:


Regardless of whether you use the lower-level split function, or the
higher-level aggregate function, or the tidyverse group_by function, the
key is learning how to create the column that is the same for all records
corresponding to the time interval of interest.


Jeff,

I definitely agree with the above


If you convert the sampdate to POSIXct, the tz IS important, because most
of us use local timezones that respect daylight savings time, and a naive
conversion of standard time will run into trouble if R is assuming
daylight savings time applies. The lubridate package gets around this by
always assuming UTC and giving you a function to "fix" the timezone after
the conversion. I prefer to always be specific about timezones, at least
by using so something like
   Sys.setenv( TZ = "Etc/GMT+8" )
which does not respect daylight savings.


I'm not following you here. All my projects have always been in a single
time zone and the data might be recorded at June 19th or November 4th but do
not depend on whether the time is PDT or PST. My hosts all set the hardware
clock to local time, not UTC.

As the location(s) at which data are collected remain fixed geographically I
don't understand why daylight savings time, or non-daylight savings time is
important.


Regarding using character data for identifying the month, in order to have
clean plots of the data I prefer to use the trunc function but it returns
a POSIXlt so I convert it to POSIXct:


I don't use character data for months, as far as I know. If a sample data
is, for example, 2021-09-03 then monthly summaries are based on '09', not
'September.'

I've always valued your inputs to help me understand what I don't. In this
case I'm really lost in understanding your position.

Have a good Labor Day weekend,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-09-03 Thread Rich Shepard

On Thu, 2 Sep 2021, Jeff Newmiller wrote:


Regardless of whether you use the lower-level split function, or the
higher-level aggregate function, or the tidyverse group_by function, the
key is learning how to create the column that is the same for all records
corresponding to the time interval of interest.


Jeff,

I tried responding to only you but my message bounced:

: host
d9300a.ess.barracudanetworks.com[209.222.82.252] said: 550 permanent
failure for one or more recipients (jdnew...@dcn.davis.ca.us:blocked) (in
reply to end of DATA command)

My response was not pertininet to the entire list, IMO, so I sent it to your
address.

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-09-02 Thread Jeff Newmiller
Regardless of whether you use the lower-level split function, or the 
higher-level aggregate function, or the tidyverse group_by function, the key is 
learning how to create the column that is the same for all records 
corresponding to the time interval of interest.

If you convert the sampdate to POSIXct, the tz IS important, because most of us 
use local timezones that respect daylight savings time, and a naive conversion 
of standard time will run into trouble if R is assuming daylight savings time 
applies. The lubridate package gets around this by always assuming UTC and 
giving you a function to "fix" the timezone after the conversion. I prefer to 
always be specific about timezones, at least by using so something like

Sys.setenv( TZ = "Etc/GMT+8" )

which does not respect daylight savings.

Regarding using character data for identifying the month, in order to have 
clean plots of the data I prefer to use the trunc function but it returns a 
POSIXlt so I convert it to POSIXct:

discharge$sampmonthbegin <- as.POSIXct( trunc( discharge$sampdate, units = 
"months" ) )

Then any of various ways can be used to aggregate the records by that column.

On September 2, 2021 12:10:15 PM PDT, Andrew Simmons  wrote:
>You could use 'split' to create a list of data frames, and then apply a
>function to each to get the means and sds.
>
>
>cols <- "cfs"  # add more as necessary
>S <- split(discharge[cols], format(discharge$sampdate, format = "%Y-%m"))
>means <- do.call("rbind", lapply(S, colMeans, na.rm = TRUE))
>sds   <- do.call("rbind", lapply(S, function(xx) sapply(xx, sd, na.rm =
>TRUE)))
>
>On Thu, Sep 2, 2021 at 3:01 PM Rich Shepard 
>wrote:
>
>> On Thu, 2 Sep 2021, Rich Shepard wrote:
>>
>> > If I correctly understand the output of as.POSIXlt each date and time
>> > element is separate, so input such as 2016-03-03 12:00 would now be 2016
>> 03
>> > 03 12 00 (I've not read how the elements are separated). (The TZ is not
>> > important because all data are either PST or PDT.)
>>
>> Using this script:
>> discharge <- read.csv('../data/water/discharge.dat', header = TRUE, sep =
>> ',', stringsAsFactors = FALSE)
>> discharge$sampdate <- as.POSIXlt(discharge$sampdate, tz = "",
>>   format = '%Y-%m-%d %H:%M',
>>   optional = 'logical')
>> discharge$cfs <- as.numeric(discharge$cfs, length = 6)
>>
>> I get this result:
>> > head(discharge)
>>   sampdatecfs
>> 1 2016-03-03 12:00:00 149000
>> 2 2016-03-03 12:10:00 15
>> 3 2016-03-03 12:20:00 151000
>> 4 2016-03-03 12:30:00 156000
>> 5 2016-03-03 12:40:00 154000
>> 6 2016-03-03 12:50:00 15
>>
>> I'm completely open to suggestions on using this output to calculate
>> monthly
>> means and sds.
>>
>> If dplyr:summarize() will do so please show me how to modify this command:
>> disc_monthly <- ( discharge
>>  %>% group_by(sampdate)
>>  %>% summarize(exp_value = mean(cfs, na.rm = TRUE))
>> because it produces daily means, not monthly means.
>>
>> TIA,
>>
>> Rich
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>   [[alternative HTML version deleted]]
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-09-02 Thread Rich Shepard

On Thu, 2 Sep 2021, Andrew Simmons wrote:


You could use 'split' to create a list of data frames, and then apply a
function to each to get the means and sds.

cols <- "cfs"  # add more as necessary
S <- split(discharge[cols], format(discharge$sampdate, format = "%Y-%m"))
means <- do.call("rbind", lapply(S, colMeans, na.rm = TRUE))
sds   <- do.call("rbind", lapply(S, function(xx) sapply(xx, sd, na.rm =
TRUE)))


Andrew,

Thank you for the valuable lesson. This is new to me and I know I'll have
use for it in the future, too.

Much appreciated!

Stay well,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-09-02 Thread Andrew Simmons
You could use 'split' to create a list of data frames, and then apply a
function to each to get the means and sds.


cols <- "cfs"  # add more as necessary
S <- split(discharge[cols], format(discharge$sampdate, format = "%Y-%m"))
means <- do.call("rbind", lapply(S, colMeans, na.rm = TRUE))
sds   <- do.call("rbind", lapply(S, function(xx) sapply(xx, sd, na.rm =
TRUE)))

On Thu, Sep 2, 2021 at 3:01 PM Rich Shepard 
wrote:

> On Thu, 2 Sep 2021, Rich Shepard wrote:
>
> > If I correctly understand the output of as.POSIXlt each date and time
> > element is separate, so input such as 2016-03-03 12:00 would now be 2016
> 03
> > 03 12 00 (I've not read how the elements are separated). (The TZ is not
> > important because all data are either PST or PDT.)
>
> Using this script:
> discharge <- read.csv('../data/water/discharge.dat', header = TRUE, sep =
> ',', stringsAsFactors = FALSE)
> discharge$sampdate <- as.POSIXlt(discharge$sampdate, tz = "",
>   format = '%Y-%m-%d %H:%M',
>   optional = 'logical')
> discharge$cfs <- as.numeric(discharge$cfs, length = 6)
>
> I get this result:
> > head(discharge)
>   sampdatecfs
> 1 2016-03-03 12:00:00 149000
> 2 2016-03-03 12:10:00 15
> 3 2016-03-03 12:20:00 151000
> 4 2016-03-03 12:30:00 156000
> 5 2016-03-03 12:40:00 154000
> 6 2016-03-03 12:50:00 15
>
> I'm completely open to suggestions on using this output to calculate
> monthly
> means and sds.
>
> If dplyr:summarize() will do so please show me how to modify this command:
> disc_monthly <- ( discharge
>  %>% group_by(sampdate)
>  %>% summarize(exp_value = mean(cfs, na.rm = TRUE))
> because it produces daily means, not monthly means.
>
> TIA,
>
> Rich
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-09-02 Thread Rich Shepard

On Thu, 2 Sep 2021, Rich Shepard wrote:


If I correctly understand the output of as.POSIXlt each date and time
element is separate, so input such as 2016-03-03 12:00 would now be 2016 03
03 12 00 (I've not read how the elements are separated). (The TZ is not
important because all data are either PST or PDT.)


Using this script:
discharge <- read.csv('../data/water/discharge.dat', header = TRUE, sep = ',', 
stringsAsFactors = FALSE)
discharge$sampdate <- as.POSIXlt(discharge$sampdate, tz = "",
 format = '%Y-%m-%d %H:%M',
 optional = 'logical')
discharge$cfs <- as.numeric(discharge$cfs, length = 6)

I get this result:

head(discharge)

 sampdatecfs
1 2016-03-03 12:00:00 149000
2 2016-03-03 12:10:00 15
3 2016-03-03 12:20:00 151000
4 2016-03-03 12:30:00 156000
5 2016-03-03 12:40:00 154000
6 2016-03-03 12:50:00 15

I'm completely open to suggestions on using this output to calculate monthly
means and sds.

If dplyr:summarize() will do so please show me how to modify this command:
disc_monthly <- ( discharge
%>% group_by(sampdate)
%>% summarize(exp_value = mean(cfs, na.rm = TRUE))
because it produces daily means, not monthly means.

TIA,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-09-02 Thread Rich Shepard

On Mon, 30 Aug 2021, Richard O'Keefe wrote:


x <- rnorm(samples.per.day * 365)
length(x)

[1] 105120

Reshape the fake data into a matrix where each row represents one
24-hour period.


m <- matrix(x, ncol=samples.per.day, byrow=TRUE)


Richard,

Now I understand the need to keep the date and time as a single datetime
column; separately dplyr's sumamrize() provides daily means (too many data
points to plot over 3-5 years). I reformatted the data to provide a
sampledatetime column and a values column.

If I correctly understand the output of as.POSIXlt each date and time
element is separate, so input such as 2016-03-03 12:00 would now be 2016 03
03 12 00 (I've not read how the elements are separated). (The TZ is not
important because all data are either PST or PDT.)


Now we can summarise the rows any way we want.
The basic tool here is ?apply.
?rowMeans is said to be faster than using apply to calculate means,
so we'll use that.  There is no *rowSds so we have to use apply
for the standard deviation.  I use ?head because I don't want to
post tens of thousands of meaningless numbers.


If I create a matrix using the above syntax the resulting rows contain all
recorded values for a specific day. What would be the syntax to collect all
values for each month?

This would result in 12 rows per year; the periods of record for the five
variables availble from that gauge station vary in length.

Regards,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data [RESOLVED]

2021-09-01 Thread Rich Shepard

On Tue, 31 Aug 2021, Jeff Newmiller wrote:


Never use stringsAsFactors on uncleaned data. For one thing you give a
factor to as.Date and it tries to make sense of the integer
representation, not the character representation.


Jeff,

Oops! I had changed it in a previous version of the script and for got to
change it back again. Fixed


dtad <- (   dta
   %>% group_by( sampdate )
   %>% summarise( exp_value = mean(cfs, na.rm = TRUE)
, Count = n()
)
   )


Thank you. Now I understand how to use dplyr's summarize().

Best regards,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-09-01 Thread Rich Shepard

On Wed, 1 Sep 2021, Richard O'Keefe wrote:


You have missed the point. The issue is not the temporal distance, but the
fact that the data you have are NOT the raw instrumental data and are NOT
subject to the limitations of the recording instruments. The data you get
from the USGS is not the raw instrumental value, and there is no longer
any good reason for there to be any gaps in it. Indeed, the Rogue River
data I looked at explicitly includes some flows labelled "Ae" meaning that
they are NOT the instrumental data at all, but estimated.


Richard,

Thanks for your comments.

Regards,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-31 Thread Richard O'Keefe
I wrote:
> > By the time you get the data from the USGS, you are already far past the 
> > point
> > where what the instruments can write is important.
Rich Shepard replied:
> The data are important because they show what's happened in that period of
> record. Don't physicians take a medical history from patients even though
> those data are far past the point they occurred?

You have missed the point.  The issue is not the temporal distance, but the
fact that the data you have are NOT the raw instrumental data and are NOT
subject to the limitations of the recording instruments.  The data you get from
the USGS is not the raw instrumental value, and there is no longer any good
reason for there to be any gaps in it.  Indeed, the Rogue River data I looked
at explicitly includes some flows labelled "Ae" meaning that they are NOT the
instrumental data at all, but estimated.

> And I use emacs to replace the space between columns with commas so the date
> and the time are separate.

There does not seem to be any good reason for this.
As I demonstrated, it is easy to convert these timestamps to
POSIXct form, which is good for calculating with.
If you want to extract year, month, day, , by far the easiest
way is to convert to POSIXlt form (so keeping the timestamp as a
single field) and then use $ to extract the field.
> n <- as.POSIXlt("2003.04.05 06:07", format="%Y.%m.%d %H:%M", tz="UTC")
> n
[1] "2003-04-05 06:07:00 UTC"
> c(n$year+1900, n$mon+1, n$mday, n$hour, $min)
[1] 20034567

> > The flow is dominated by a series of "bursts" with a fast onset to a peak
> > and a slow decay, coming in a range of sizes from quite small to rather
> > large, separated by gaps of 4 to 45 days.
>
> And when discharge is controlled by flows through a hydroelectric dam there
> is a lot of variability. The pattern is important to fish as well as
> regulators.

And what is important to fish is NOT captured by daily means and standard
deviations.  For what it's worth, my understanding is that most of the dams on
the Rogue River have been removed, leaving only the Lost Creek Lake one,
and that this has been good for the fish.

Suppose you have a day when there are 16 hours with no water at all flowing,
then 8 hours with 12 cumecs because a dam upstream is discharging.  Then
the daily mean is 4 cumecs, which might look good for fish, but it wasn't.
"Number of minutes below minimum safe level" might be more interesting
for the fish.

>From the data we have alone, we cannot tell which bursts are due to
releases from dams and which have other causes.  Dam releases are under
human control, storms are not.

Looking at the Rogue River data, plotting daily means
- lowers the peaks
- moves them right
- changes the overall shape
Not severely, mind you, but enough to avoid if you don't have to.

By the way, by far the easiest way to do day-wise summaries,
if you really feel you must, is to start with a POSIXct or POSIXlt
column, let's call it r$when, then
  d <- trunc(difftime(r$when, min(r$when), units="days)) + 1
  m <- aggregate(r$flow, by=list(d), FUN=mean)
  plot(m, type="l")
You can plug in other summary functions, not just mean.

Remember:
  for all calculations involving dates and times,
  prefer using the built in date and time classes to
  hacking around the problem

  aggregate() is a good way to compute oddball summaries.


> > - how do I *detect* these bursts? (detecting a peak isn't too hard,
> >   but the peak is not the onset)
> > - how do I *characterise* these bursts?
> >   (and is the onset rate related to the peak size?)
> > - what's left after taking the bursts out?
> > - can I relate these bursts to something going on upstream?
>
> Well, those questions could be appropriate depending on what questions you
> need the data to answer.
>
> Environmental data are quite different from experimental, economic,
> financial, and public data (e.g., unemployment, housing costs).
>
> There are always multiple ways to address an analytical need. Thank you for
> your contributions.
>
> Stay well,
>
> Rich
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-31 Thread Jeff Newmiller
Never use stringsAsFactors on uncleaned data. For one thing you give a factor 
to as.Date and it tries to make sense of the integer representation, not the 
character representation.

library(dplyr)
dta <- read.csv( text =
"sampdate,samptime,cfs
2020-08-26,09:30,136000
2020-08-26,09:35,126000
2020-08-26,09:40,13
2020-08-26,09:45,128000
2020-08-26,09:50,126000
2020-08-26,09:55,125000
2020-08-26,10:00,121000
2020-08-26,10:05,117000
2020-08-26,10:10,12
", stringsAsFactors = FALSE)
dtad <- (   dta
%>% group_by( sampdate )
%>% summarise( exp_value = mean(cfs, na.rm = TRUE)
 , Count = n()
 )
)

On August 31, 2021 2:11:05 PM PDT, Rich Shepard  
wrote:
>On Sun, 29 Aug 2021, Jeff Newmiller wrote:
>
>> The general idea is to create a "grouping" column with repeated values for
>> each day, and then to use aggregate to compute your combined results. The
>> dplyr package's group_by/summarise functions can also do this, and there
>> are also proponents of the data.table package which is high performance
>> but tends to depend on altering data in-place unlike most other R data
>> handling functions.
>
>Jeff,
>
>I've read a number of docs discussing dplyr's summerize and group_by
>functions (including that section of Hadley's 'R for Data Science' book, yet
>I'm missing something; I think that I need to separate the single sampdate
>column into colums for year, month, and day and group_by year/month
>summarizing within those groups.
>
>The data are of this format:
>sampdate,samptime,cfs
>2020-08-26,09:30,136000
>2020-08-26,09:35,126000
>2020-08-26,09:40,13
>2020-08-26,09:45,128000
>2020-08-26,09:50,126000
>2020-08-26,09:55,125000
>2020-08-26,10:00,121000
>2020-08-26,10:05,117000
>2020-08-26,10:10,12
>
>My curent script is:
>
>---8<--
>library('tidyverse')
>
>discharge <- read.table('../data/discharge.dat', header = TRUE, sep = ',', 
>stringsAsFactors = TRUE)
>discharge$sampdate <- as.Date(discharge$sampdate)
>discharge$cfs <- as.numeric(discharge$cfs, length = 6)
>
># use dplyr.summarize grouped by date
>
># need to separate sampdate into %Y-%M-%D in order to group_by the month?
>by_month <- discharge %>%
>   group_by(sampdate ...
>summarize(by_month, exp_value = mean(cfs, na.rm = TRUE), sd(cfs))
>>8
>
>and the results are:
>
>> str(discharge)
>'data.frame':  93254 obs. of  3 variables:
>  $ sampdate: Date, format: "2020-08-26" "2020-08-26" ...
>  $ samptime: Factor w/ 728 levels "00:00","00:05",..: 115 116 117 118 123 128 
> 133 138 143 148 ...
>  $ cfs : num  176 156 165 161 156 154 144 137 142 142 ...
>> ls()
>[1] "by_month"  "discharge"
>> by_month
># A tibble: 93,254 × 3
># Groups:   sampdate [322]
>sampdate   samptime   cfs
> 
>  1 2020-08-26 09:30  176
>  2 2020-08-26 09:35  156
>  3 2020-08-26 09:40  165
>  4 2020-08-26 09:45  161
>  5 2020-08-26 09:50  156
>  6 2020-08-26 09:55  154
>  7 2020-08-26 10:00  144
>  8 2020-08-26 10:05  137
>  9 2020-08-26 10:10  142
>10 2020-08-26 10:15  142
># … with 93,244 more rows
>
>I don't know why the discharge values are truncated to 3 digits when they're
>6 digits in the input data.
>
>Suggested readings appreciated,
>
>Rich
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-31 Thread Rich Shepard

On Sun, 29 Aug 2021, Jeff Newmiller wrote:


The general idea is to create a "grouping" column with repeated values for
each day, and then to use aggregate to compute your combined results. The
dplyr package's group_by/summarise functions can also do this, and there
are also proponents of the data.table package which is high performance
but tends to depend on altering data in-place unlike most other R data
handling functions.


Jeff,

I've read a number of docs discussing dplyr's summerize and group_by
functions (including that section of Hadley's 'R for Data Science' book, yet
I'm missing something; I think that I need to separate the single sampdate
column into colums for year, month, and day and group_by year/month
summarizing within those groups.

The data are of this format:
sampdate,samptime,cfs
2020-08-26,09:30,136000
2020-08-26,09:35,126000
2020-08-26,09:40,13
2020-08-26,09:45,128000
2020-08-26,09:50,126000
2020-08-26,09:55,125000
2020-08-26,10:00,121000
2020-08-26,10:05,117000
2020-08-26,10:10,12

My curent script is:

---8<--
library('tidyverse')

discharge <- read.table('../data/discharge.dat', header = TRUE, sep = ',', 
stringsAsFactors = TRUE)
discharge$sampdate <- as.Date(discharge$sampdate)
discharge$cfs <- as.numeric(discharge$cfs, length = 6)

# use dplyr.summarize grouped by date

# need to separate sampdate into %Y-%M-%D in order to group_by the month?
by_month <- discharge %>%
  group_by(sampdate ...
summarize(by_month, exp_value = mean(cfs, na.rm = TRUE), sd(cfs))
>8

and the results are:


str(discharge)

'data.frame':   93254 obs. of  3 variables:
 $ sampdate: Date, format: "2020-08-26" "2020-08-26" ...
 $ samptime: Factor w/ 728 levels "00:00","00:05",..: 115 116 117 118 123 128 
133 138 143 148 ...
 $ cfs : num  176 156 165 161 156 154 144 137 142 142 ...

ls()

[1] "by_month"  "discharge"

by_month

# A tibble: 93,254 × 3
# Groups:   sampdate [322]
   sampdate   samptime   cfs

 1 2020-08-26 09:30  176
 2 2020-08-26 09:35  156
 3 2020-08-26 09:40  165
 4 2020-08-26 09:45  161
 5 2020-08-26 09:50  156
 6 2020-08-26 09:55  154
 7 2020-08-26 10:00  144
 8 2020-08-26 10:05  137
 9 2020-08-26 10:10  142
10 2020-08-26 10:15  142
# … with 93,244 more rows

I don't know why the discharge values are truncated to 3 digits when they're
6 digits in the input data.

Suggested readings appreciated,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-31 Thread Rich Shepard

On Tue, 31 Aug 2021, Richard O'Keefe wrote:


By the time you get the data from the USGS, you are already far past the point
where what the instruments can write is important.


Richard,

The data are important because they show what's happened in that period of
record. Don't physicians take a medical history from patients even though
those data are far past the point they occurred?


agency_cd site_no datetime tz_cd 71932_00060 71932_00060_cd
5s 15s 20d 6s 14n 10s

(I do not know what the last line signifies.)


The numbers represent the space for each fixed-width field.


After using read.delim to read the file

I note that the timestamps are in a single column, formatted like
"2020-08-30 00:15", matching the pattern "%Y-%m-%d %H:%M".

After reading the data into R and using
r$datetime <- as.POSIXct(r$datetime, format="%Y-%m-%d %H:%M",
  tz=r$tz_cd)


And I use emacs to replace the space between columns with commas so the date
and the time are separate.


So for this data set, spanning one year, all the times are in the same time
zone, observations are 15 minutes apart, not 5, and there are no missing
data.  This was obviously the wrong data set.


As I provided when I first asked for suggestions:
sampdate,samptime,cfs
2020-08-26,09:30,136000
2020-08-26,09:35,126000
2020-08-26,09:40,13
2020-08-26,09:45,128000
2020-08-26,09:50,126000
2020-08-26,09:55,125000

The recorded values are 5 minutes apart.

That data set is immaterial for my project but perfect when one needs data
from that gauge station on the Rogue River.


The flow is dominated by a series of "bursts" with a fast onset to a peak
and a slow decay, coming in a range of sizes from quite small to rather
large, separated by gaps of 4 to 45 days.


And when discharge is controlled by flows through a hydroelectric dam there
is a lot of variability. The pattern is important to fish as well as
regulators.


I'd be looking at
- how do I *detect* these bursts? (detecting a peak isn't too hard,
  but the peak is not the onset)
- how do I *characterise* these bursts?
  (and is the onset rate related to the peak size?)
- what's left after taking the bursts out?
- can I relate these bursts to something going on upstream?


Well, those questions could be appropriate depending on what questions you
need the data to answer.

Environmental data are quite different from experimental, economic,
financial, and public data (e.g., unemployment, housing costs).

There are always multiple ways to address an analytical need. Thank you for
your contributions.

Stay well,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-30 Thread Richard O'Keefe
By the time you get the data from the USGS, you are already far past the point
where what the instruments can write is important.
(Obviously an instrument can be sufficiently broken that it cannot
write anything.)
The data for Rogue River that I just downloaded include this comment:

# Data for the following 1 site(s) are contained in this file
#USGS 04118500 ROGUE RIVER NEAR ROCKFORD, MI
# 
---
#
# Data provided for site 04118500
#TS   parameter Description
# 71932   00060 Discharge, cubic feet per second
#
# Data-value qualification codes included in this output:
# A  Approved for publication -- Processing and review completed.
# P  Provisional data subject to revision.
# e  Value has been estimated.
#
agency_cd site_no datetime tz_cd 71932_00060 71932_00060_cd
5s 15s 20d 6s 14n 10s

(I do not know what the last line signifies.)
It is, I think, sufficiently clear that the instrument does not know what
the qualification code is!

After using read.delim to read the file

I note that the timestamps are in a single column, formatted like
"2020-08-30 00:15", matching the pattern "%Y-%m-%d %H:%M".

After reading the data into R and using
r$datetime <- as.POSIXct(r$datetime, format="%Y-%m-%d %H:%M",
   tz=r$tz_cd)
I get
  agency   sitedatetime tz
 USGS:33550   Min.   :4118500   Min.   :2020-08-30 00:00:00   EST:33550
  1st Qu.:4118500   1st Qu.:2020-11-25 13:33:45
  Median :4118500   Median :2021-03-08 03:52:30
  Mean   :4118500   Mean   :2021-03-01 07:05:54
  3rd Qu.:4118500   3rd Qu.:2021-06-03 12:41:15
  Max.   :4118500   Max.   :2021-08-30 22:00:00
  flowqual
 Min.   : 96.5   A  :18052
 1st Qu.:156.0   A:e:  757
 Median :193.0   P  :14741
 Mean   :212.5
 3rd Qu.:237.0
 Max.   :767.0

So for this data set, spanning one year, all the times are in the same time
zone, observations are 15 minutes apart, not 5, and there are no missing
data.  This was obviously the wrong data set.
Oh well, picking an epoch such as
> epoch <- min(r$datetime)
and then calculating
as.numeric(difftime(timestamp, epoch, units="min")))
will give you a minute count from which determining day number
and bucket within day is trivial arithmetic.

I have attached a plot of the Rogue River flows which should make it
very clear what I mean by saying that means and standard deviations
are not a good way to characterise this kind of data.

The flow is dominated by a series of "bursts" with a fast onset to a peak
and a slow decay, coming in a range of sizes from quite small to rather
large, separated by gaps of 4 to 45 days.

I'd be looking at
 - how do I *detect* these bursts? (detecting a peak isn't too hard,
   but the peak is not the onset)
 - how do I *characterise* these bursts?
   (and is the onset rate related to the peak size?)
 - what's left after taking the bursts out?
 - can I relate these bursts to something going on upstream?

My usual recommendation is to start with things available in R out of the
box in order to reduce learning time.

On Tue, 31 Aug 2021 at 11:34, Rich Shepard  wrote:
>
> On Tue, 31 Aug 2021, Richard O'Keefe wrote:
>
> > I made up fake data in order to avoid showing untested code. It's not part
> > of the process I was recommending. I expect data recorded every N minutes
> > to use NA when something is missing, not to simply not be recorded. Well
> > and good, all that means is that reshaping the data is not a trivial call
> > to matrix(). It does not mean that any additional package is needed or
> > appropriate and it does not affect the rest of the process.
>
> Richard,
>
> The instruments in the gauge pipe don't know to write NA when they're not
> measuring. :-) The outage period varies greatly by location, constituent
> measured, and other unknown factors.
>
> > You will want the POSIXct class, see ?DateTimeClasses. Do you know whether
> > the time stamps are in universal time or in local time?
>
> The data values are not timestamps. There's one column for date a second
> colume for time and a third column for time zone (P in the case of the west
> coast.
>
> > Above all, it doesn't affect the point that you probably should not
> > be doing any of this.
>
> ? (Doesn't require an explanation.)
>
> Rich
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


Rogue River.pdf
Description: Adobe PDF document
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 

Re: [R] Calculate daily means from 5-minute interval data

2021-08-30 Thread Bert Gunter
I do not wish to express any opinion on what should be done or how. But...

1. I assume that when data are missing, they are missing -- i.e.
simply not present in the data. So there will be possibly several/many
in succession missing rows of data corresponding to those times,
right? (Apologies for being a bit dumb about this, but I always need
to check that what I think is blindingly obvious really is).

2. Do note that when one takes daily averages/sd's/whatever summaries
of data that, because of missingness, may be calculated from possibly
quite different numbers of data points -- are whole days sometimes
missing?? -- then all the summaries (e.g. means) are not created
equal: summaries created from more data are more "trustworthy" and
should receive "appropriately" greater weight than those created from
fewer. Makes sense, right?

So I suspect that this may not be as straightforward as you think --
you may wish to find a local statistician with some experience in
these sorts of things to help you deal with them. Up to you, of
course.

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, Aug 30, 2021 at 4:34 PM Rich Shepard  wrote:
>
> On Tue, 31 Aug 2021, Richard O'Keefe wrote:
>
> > I made up fake data in order to avoid showing untested code. It's not part
> > of the process I was recommending. I expect data recorded every N minutes
> > to use NA when something is missing, not to simply not be recorded. Well
> > and good, all that means is that reshaping the data is not a trivial call
> > to matrix(). It does not mean that any additional package is needed or
> > appropriate and it does not affect the rest of the process.
>
> Richard,
>
> The instruments in the gauge pipe don't know to write NA when they're not
> measuring. :-) The outage period varies greatly by location, constituent
> measured, and other unknown factors.
>
> > You will want the POSIXct class, see ?DateTimeClasses. Do you know whether
> > the time stamps are in universal time or in local time?
>
> The data values are not timestamps. There's one column for date a second
> colume for time and a third column for time zone (P in the case of the west
> coast.
>
> > Above all, it doesn't affect the point that you probably should not
> > be doing any of this.
>
> ? (Doesn't require an explanation.)
>
> Rich
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-30 Thread Avi Gross via R-help
Am I seeing an odd aspect to this discussion.

There are many ways to solve problems and some may be favored by some more
than others.

All require some examination of the data so it can be massaged into shape
for the processes that follow.

If you insist on using the matrix method to arrange that each row or column
has the data you want, then, yes, you need to guarantee all your data is
present and in the right order. If some may be missing, you may want to
write a program that generates all possible dates in order and interpolates
them back (or into a copy more likely) so all the missing items are
represented and show up as an NA or whatever you want. You may also want to
check all dates are in order with no duplicates and anything else that makes
sense and then you are free to ask the vector to be seen as a matrix with N
columns or rows.

For many, the solution is much cleaner to use constructs that may be more
resistant to imperfections or allow them to be treated better. I would
probably use tidyverse functionality these days but can easily understand
people preferring base R or other packages. I have done similar analyses of
real data gathered from streams of various chemicals and levels taken at
various times and depths including times no measures happened and times
there were more than one measure. It is thus much more robust to use methods
like group_by and then apply other such verbs already being done grouped and
especially when the next steps involved making plots with ggplot. It was
rather trivial for example, to replace multiple measures by the average of
the measures. And many of my plots are faceted by variables which is not
trivial to do in base R.

I suggest not falling in love with the first way you think of and try to
bend everything to fit. Yes, some methods may be quite a bit more efficient
but rarely do I run into problems even with quite large collections of data
like a quarter million rows with dozens of columns, including odd columns
like the output of some analysis.

And note the current set of data may be extended with more over time or you
may get other data collected that would not necessarily work well with a
hard-coded method but might easily adjust to a new method. 

-Original Message-
From: R-help  On Behalf Of Rich Shepard
Sent: Monday, August 30, 2021 7:34 PM
To: R Project Help 
Subject: Re: [R] Calculate daily means from 5-minute interval data

On Tue, 31 Aug 2021, Richard O'Keefe wrote:

> I made up fake data in order to avoid showing untested code. It's not 
> part of the process I was recommending. I expect data recorded every N 
> minutes to use NA when something is missing, not to simply not be 
> recorded. Well and good, all that means is that reshaping the data is 
> not a trivial call to matrix(). It does not mean that any additional 
> package is needed or appropriate and it does not affect the rest of the
process.

Richard,

The instruments in the gauge pipe don't know to write NA when they're not
measuring. :-) The outage period varies greatly by location, constituent
measured, and other unknown factors.

> You will want the POSIXct class, see ?DateTimeClasses. Do you know 
> whether the time stamps are in universal time or in local time?

The data values are not timestamps. There's one column for date a second
colume for time and a third column for time zone (P in the case of the west
coast.

> Above all, it doesn't affect the point that you probably should not be 
> doing any of this.

? (Doesn't require an explanation.)

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-30 Thread Rich Shepard

On Tue, 31 Aug 2021, Richard O'Keefe wrote:


I made up fake data in order to avoid showing untested code. It's not part
of the process I was recommending. I expect data recorded every N minutes
to use NA when something is missing, not to simply not be recorded. Well
and good, all that means is that reshaping the data is not a trivial call
to matrix(). It does not mean that any additional package is needed or
appropriate and it does not affect the rest of the process.


Richard,

The instruments in the gauge pipe don't know to write NA when they're not
measuring. :-) The outage period varies greatly by location, constituent
measured, and other unknown factors.


You will want the POSIXct class, see ?DateTimeClasses. Do you know whether
the time stamps are in universal time or in local time?


The data values are not timestamps. There's one column for date a second
colume for time and a third column for time zone (P in the case of the west
coast.


Above all, it doesn't affect the point that you probably should not
be doing any of this.


? (Doesn't require an explanation.)

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-30 Thread Richard O'Keefe
I made up fake data in order to avoid showing untested code.
It's not part of the process I was recommending.
I expect data recorded every N minutes to use NA when something
is missing, not to simply not be recorded.  Well and good, all that
means is that reshaping the data is not a trivial call to matrix().
It does not mean that any additional package is needed or appropriate
and it does not affect the rest of the process.

You will want the POSIXct class, see ?DateTimeClasses.
Do you know whether the time stamps are in universal time or in
local time?

Above all, it doesn't affect the point that you probably should not
be doing any of this.

On Tue, 31 Aug 2021 at 00:42, Rich Shepard  wrote:
>
> On Mon, 30 Aug 2021, Richard O'Keefe wrote:
>
> > Why would you need a package for this?
> >> samples.per.day <- 12*24
> >
> > That's 12 5-minute intervals per hour and 24 hours per day.
> > Generate some fake data.
>
> Richard,
>
> The problem is that there are days with fewer than 12 recorded values for
> various reasons.
>
> When testing algorithms I use small subsets of actual data rather than fake
> data.
>
> Thanks for your detailed procedure.
>
> Regards,
>
> Rich
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-30 Thread Rich Shepard

On Mon, 30 Aug 2021, Richard O'Keefe wrote:


Why would you need a package for this?

samples.per.day <- 12*24


That's 12 5-minute intervals per hour and 24 hours per day.
Generate some fake data.


Richard,

The problem is that there are days with fewer than 12 recorded values for
various reasons.

When testing algorithms I use small subsets of actual data rather than fake
data.

Thanks for your detailed procedure.

Regards,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-30 Thread Richard O'Keefe
It is not clear to me who Jeff Newmiller's comment about periodicity
is addressed to.
The original poster, for asking for daily summaries?
A summary of what I wrote:
- daily means and standard deviations are a very poor choice for river flow data
- if you insist on doing that anyway, no fancy packages are required, just
  reshape the data into a matrix where rows correspond to days using matrix()
  and summarise it using rowMeans() and apply(... FUN=sd).
- but it is quite revealing to just plot the data using image(), which makes no
  assumptions about periodicity or anything else, just a way of wrapping 1D
  data to fill a 2D space and still have interpretable axes
- the river data I examined showed fairly boring time series interrupted by
  substantial shocks (caused by rainfall in catchment areas).

New stuff...
The river data I looked at came from Environment Canterbury.
River flows there are driven by (a) snow-melt from the Southern Alps.
which *is* roughly periodic with a period of one year and (b) rainfall
events which charge the upstream catchment areas, leading to a
rapid ramp up following by a slower exponential-looking decay.
The (a) elemen happens to be invisible in the Environment Canterbury
data, as they only release the latest month of flow data, The ratio
between low flows and high flows ranged from 2 to 10 in the data I
could get.

The (b) component is NOT periodic and is NOT aligned with days
and is NOT predictable and is extremely important.

Where you begin is not with R or a search for packages but with
the question "what is actually going on in the real world?  What
are the influences on river flow, are they natural (and which) or
human (and which)?"  It's going to matter a lot how much
irrigation water is drawn from a river, and that may be roughly
predictable.  If water is occasionally diverted into another river
for flood control, that's going to make a difference.  If there is a
dam, that's going to make a difference.  Rainfall and snowmelt
are going to be seasonal (in a hand-wavy sense) but differently so.

And there is an equally important question:  "Why am I doing this?
What do I want to see in the data that doesn't already leap to the
eye?  What is anyone going to DO differently if they see that?"
Are you interested in whether minimum flows are adequate for
irrigation or whether flood control systems are adequate for high
flows?

Thinking about the people who might read my report, if I were
tasked with analysing river data, I would want to analyse the
data and present the results in such a way that most of them
would say "Why did I need this guy?  It's so obvious!  I could
have done that!  (If I had ever thought of it.)"  But that is because
I am thinking of farmers and politicians who have other maddened
grizzly bears to stun (thanks, Terry Pratchett).  If writing for an
audience of hydrologists and statisticians, you would make
different choices.

Here's a little bit of insight from the physics.
Why is it that the spikes in the flows rise rapidly and fall slowly?
Because the fall is limited by the rate at which the river system
can carry water away, but the rate at which a storm can deliver
water to the river system is not.  Did I know this before looking
at the ECan data?  Well, I had *seen* rivers rising rapidly and
falling slowly, but I had never *observed*; I had never thought
about it.   But now that I have, it's *obvious*: you cannot
understand the river without understanding the weather that
the river is subject to.  Anyone who genuinely understands
hydrology is looking at me sadly and saying "Just now you
figured this out?  At your mother's knee you didn't learn this?"
But it has such repercussions.  It means you need data on
rainfall in the catchment areas.  (Which ECan, to their credit,
also provide.)  In an important sense, there is no right way to
analyse river flow data *on its own*.

On Mon, 30 Aug 2021 at 14:47, Jeff Newmiller  wrote:
>
> IMO assuming periodicity is a bad practice for this. Missing timestamps 
> happen too, and there is no reason to build a broken analysis process.
>
> On August 29, 2021 7:09:01 PM PDT, Richard O'Keefe  wrote:
> >Why would you need a package for this?
> >> samples.per.day <- 12*24
> >
> >That's 12 5-minute intervals per hour and 24 hours per day.
> >Generate some fake data.
> >
> >> x <- rnorm(samples.per.day * 365)
> >> length(x)
> >[1] 105120
> >
> >Reshape the fake data into a matrix where each row represents one
> >24-hour period.
> >
> >> m <- matrix(x, ncol=samples.per.day, byrow=TRUE)
> >
> >Now we can summarise the rows any way we want.
> >The basic tool here is ?apply.
> >?rowMeans is said to be faster than using apply to calculate means,
> >so we'll use that.  There is no *rowSds so we have to use apply
> >for the standard deviation.  I use ?head because I don't want to
> >post tens of thousands of meaningless numbers.
> >
> >> head(rowMeans(m))
> >[1] -0.03510177  0.11817337  0.06725203 -0.03578195 -0.02448077 

Re: [R] Calculate daily means from 5-minute interval data

2021-08-29 Thread Jeff Newmiller
IMO assuming periodicity is a bad practice for this. Missing timestamps happen 
too, and there is no reason to build a broken analysis process.

On August 29, 2021 7:09:01 PM PDT, Richard O'Keefe  wrote:
>Why would you need a package for this?
>> samples.per.day <- 12*24
>
>That's 12 5-minute intervals per hour and 24 hours per day.
>Generate some fake data.
>
>> x <- rnorm(samples.per.day * 365)
>> length(x)
>[1] 105120
>
>Reshape the fake data into a matrix where each row represents one
>24-hour period.
>
>> m <- matrix(x, ncol=samples.per.day, byrow=TRUE)
>
>Now we can summarise the rows any way we want.
>The basic tool here is ?apply.
>?rowMeans is said to be faster than using apply to calculate means,
>so we'll use that.  There is no *rowSds so we have to use apply
>for the standard deviation.  I use ?head because I don't want to
>post tens of thousands of meaningless numbers.
>
>> head(rowMeans(m))
>[1] -0.03510177  0.11817337  0.06725203 -0.03578195 -0.02448077 -0.03033692
>> head(apply(m, MARGIN=1, FUN=sd))
>[1] 1.0017718 0.9922920 1.0100550 0.9956810 1.0077477 0.9833144
>
>Now whether this is a *sensible* way to summarise your flow data is a question
>that a hydrologist would be better placed to answer.  I would have started with
>> plot(density(x))
>which I just did with some real river data (only a month of it, sigh).
>Very long tail.
>Even
>> plot(density(log(r)))
>shows a very long tail.  Time to plot the data against time.  Oh my!
>All of the long tail came from a single event.
>There's a period of low flow, then there's a big rainstorm and the
>flow goes WAY up, then over about two days the flow subsides to a new
>somewhat higher level.
>
>None of this is reflected in means or standard deviations.
>This is *time series* data, and time series data of a fairly special kind.
>
>One thing that might be helpful with your data would simply be
>> image(log(m))
>For my one month sample, the spike showed up very clearly that way.
>Because right now, your first task is to get an idea of what the data
>look like, and means-and-standard-deviations won't really do that.
>
>Oh heck, here's another reason to go with image(log(m)).
>With image(m) I just see the one big spike.
>With image(log(m)), I can see that little spikes often start in the
>afternoon of one day and continue into the morning of the next.
>From daily means, it looks like two unusual, but not very
>unusual, days.  From the image, it's clearly ONE rainfall event
>that just happens to straddle a day boundary.
>
>This is all very basic stuff, which is really the point.  You want to use
>elementary tools to look at the data before you reach for fancy ones.
>
>
>On Mon, 30 Aug 2021 at 03:09, Rich Shepard  wrote:
>>
>> I have a year's hydraulic data (discharge, stage height, velocity, etc.)
>> from a USGS monitoring gauge recording values every 5 minutes. The data
>> files contain 90K-93K lines and plotting all these data would produce a
>> solid block of color.
>>
>> What I want are the daily means and standard deviation from these data.
>>
>> As an occasional R user (depending on project needs) I've no idea what
>> packages could be applied to these data frames. There likely are multiple
>> paths to extracting these daily values so summary statistics can be
>> calculated and plotted. I'd appreciate suggestions on where to start to
>> learn how I can do this.
>>
>> TIA,
>>
>> Rich
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-29 Thread Richard O'Keefe
Why would you need a package for this?
> samples.per.day <- 12*24

That's 12 5-minute intervals per hour and 24 hours per day.
Generate some fake data.

> x <- rnorm(samples.per.day * 365)
> length(x)
[1] 105120

Reshape the fake data into a matrix where each row represents one
24-hour period.

> m <- matrix(x, ncol=samples.per.day, byrow=TRUE)

Now we can summarise the rows any way we want.
The basic tool here is ?apply.
?rowMeans is said to be faster than using apply to calculate means,
so we'll use that.  There is no *rowSds so we have to use apply
for the standard deviation.  I use ?head because I don't want to
post tens of thousands of meaningless numbers.

> head(rowMeans(m))
[1] -0.03510177  0.11817337  0.06725203 -0.03578195 -0.02448077 -0.03033692
> head(apply(m, MARGIN=1, FUN=sd))
[1] 1.0017718 0.9922920 1.0100550 0.9956810 1.0077477 0.9833144

Now whether this is a *sensible* way to summarise your flow data is a question
that a hydrologist would be better placed to answer.  I would have started with
> plot(density(x))
which I just did with some real river data (only a month of it, sigh).
Very long tail.
Even
> plot(density(log(r)))
shows a very long tail.  Time to plot the data against time.  Oh my!
All of the long tail came from a single event.
There's a period of low flow, then there's a big rainstorm and the
flow goes WAY up, then over about two days the flow subsides to a new
somewhat higher level.

None of this is reflected in means or standard deviations.
This is *time series* data, and time series data of a fairly special kind.

One thing that might be helpful with your data would simply be
> image(log(m))
For my one month sample, the spike showed up very clearly that way.
Because right now, your first task is to get an idea of what the data
look like, and means-and-standard-deviations won't really do that.

Oh heck, here's another reason to go with image(log(m)).
With image(m) I just see the one big spike.
With image(log(m)), I can see that little spikes often start in the
afternoon of one day and continue into the morning of the next.
>From daily means, it looks like two unusual, but not very
unusual, days.  From the image, it's clearly ONE rainfall event
that just happens to straddle a day boundary.

This is all very basic stuff, which is really the point.  You want to use
elementary tools to look at the data before you reach for fancy ones.


On Mon, 30 Aug 2021 at 03:09, Rich Shepard  wrote:
>
> I have a year's hydraulic data (discharge, stage height, velocity, etc.)
> from a USGS monitoring gauge recording values every 5 minutes. The data
> files contain 90K-93K lines and plotting all these data would produce a
> solid block of color.
>
> What I want are the daily means and standard deviation from these data.
>
> As an occasional R user (depending on project needs) I've no idea what
> packages could be applied to these data frames. There likely are multiple
> paths to extracting these daily values so summary statistics can be
> calculated and plotted. I'd appreciate suggestions on where to start to
> learn how I can do this.
>
> TIA,
>
> Rich
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-29 Thread Rich Shepard

On Sun, 29 Aug 2021, Andrew Simmons wrote:


I would suggest something like:


Thanks, Andrew.

Stay well,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-29 Thread Rich Shepard

On Sun, 29 Aug 2021, Rui Barradas wrote:


Hope this helps,


Rui,

Greatly! I'll study it carefully so I fully understand the process.

Many thanks.

Stay well,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-29 Thread Rui Barradas

Hello,

I forgot in my previous answer, sorry for the duplicated mails.

The function in my previous mail has a na.rm argument, defaulting to 
FALSE, pass na.rm = TRUE to remove the NA's.



agg <- aggregate(cfs ~ date, df1, fun, na.rm = TRUE)



Or simply change the default. I prefer to set na.rm = FALSE because 
that's what R's functions do. And I will only be used to one default, 
with base R functions or my own code.



Hope this helps,

Rui Barradas

Às 17:52 de 29/08/21, Rich Shepard escreveu:

On Sun, 29 Aug 2021, Jeff Newmiller wrote:

The general idea is to create a "grouping" column with repeated values 
for

each day, and then to use aggregate to compute your combined results. The
dplyr package's group_by/summarise functions can also do this, and there
are also proponents of the data.table package which is high performance
but tends to depend on altering data in-place unlike most other R data
handling functions.

Also pay attention to missing data... if you have any then you will need
to consider whether you want the strictness of na.rm=FALSE or
permissiveness of na.rm=TRUE for your aggregation functions.


Jeff,

Thank you. Yes, there are missing data as sometimes the equipment fails, or
there's some other reason why some samples are missing.

Grouping on each day is just what I need. I'll re-learn dplyr and take a
look at data.table.

Regards,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-29 Thread Rich Shepard

On Sun, 29 Aug 2021, Jeff Newmiller wrote:


You may find something useful on handling timestamp data here:
https://jdnewmil.github.io/


Jeff,

I'll certainly read those articles.

Many thanks,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-29 Thread Rui Barradas

Hello,

You have date and hour in two separate columns, so to compute daily 
stats part of the work is already done. (Were they in the same column 
you would have to extract the date only.)


# convert to class "Date"
df1$date <- as.Date(df1$date)


# function to compute the stats required
# it's important to note that all the stats
# are returned in a vector, see below
fun <- function(x, na.rm = FALSE){
  c(mean_cfs = mean(x, na.rm = na.rm),
sd_cfs = sd(x, na.rm = na.rm))
}

# now this will put a *matrix* under cfs
# each row has the  statistics computed
# by the function
agg <- aggregate(cfs ~ date, df1, fun)
str(agg)
#'data.frame':  1 obs. of  2 variables:
# $ date: Date, format: "2020-08-26"
# $ cfs : num [1, 1:2] 110400 16143
#  ..- attr(*, "dimnames")=List of 2
#  .. ..$ : NULL
#  .. ..$ : chr [1:2] "mean_cfs" "sd_cfs"


# so now put everything in separate columns
agg <- cbind(agg[-ncol(agg)], agg[[ncol(agg)]])
str(agg)
#'data.frame':  1 obs. of  3 variables:
# $ date: Date, format: "2020-08-26"
# $ mean_cfs: num 110400
# $ sd_cfs  : num 16143


Hope this helps,

Rui Barradas

Às 17:49 de 29/08/21, Rich Shepard escreveu:

On Sun, 29 Aug 2021, Eric Berger wrote:

Provide dummy data (e.g. 5-10 lines), say like the contents of a csv 
file,

and calculate by hand what you'd like to see in the plot. (And describe
what the plot would look like.)


Eric,

Mea culpa! I extracted a set of sample data and forgot to include it in the
message. Here it is:

date,time,cfs
2020-08-26,09:30,136000
2020-08-26,09:35,126000
2020-08-26,09:40,13
2020-08-26,09:45,128000
2020-08-26,09:50,126000
2020-08-26,09:55,125000
2020-08-26,10:00,121000
2020-08-26,10:05,117000
2020-08-26,10:10,12
...
2020-08-26,23:10,108000
2020-08-26,23:15,96200
2020-08-26,23:20,86700
2020-08-26,23:25,103000
2020-08-26,23:30,103000
2020-08-26,23:35,99500
2020-08-26,23:40,85200
2020-08-26,23:45,103000
2020-08-26,23:50,95800
2020-08-26,23:55,88200

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-29 Thread Andrew Simmons
Hello,


I would suggest something like:


date <- seq(as.Date("2020-01-01"), as.Date("2020-12-31"), 1)
time <- sprintf("%02d:%02d", rep(0:23, each = 12), seq.int(0, 55, 5))
x <- data.frame(
date = rep(date, each = length(time)),
time = time
)
x$cfs <- stats::rnorm(nrow(x))


cols2aggregate <- "cfs"  # add more as necessary


S <- split(x[cols2aggregate], x$date)


means <- do.call("rbind", lapply(S, colMeans, na.rm = TRUE))
sds   <- do.call("rbind", lapply(S, function(xx) sapply(xx, sd, na.rm =
TRUE)))

On Sun, Aug 29, 2021 at 11:09 AM Rich Shepard 
wrote:

> I have a year's hydraulic data (discharge, stage height, velocity, etc.)
> from a USGS monitoring gauge recording values every 5 minutes. The data
> files contain 90K-93K lines and plotting all these data would produce a
> solid block of color.
>
> What I want are the daily means and standard deviation from these data.
>
> As an occasional R user (depending on project needs) I've no idea what
> packages could be applied to these data frames. There likely are multiple
> paths to extracting these daily values so summary statistics can be
> calculated and plotted. I'd appreciate suggestions on where to start to
> learn how I can do this.
>
> TIA,
>
> Rich
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-29 Thread Rich Shepard

On Sun, 29 Aug 2021, Rui Barradas wrote:


I forgot in my previous answer, sorry for the duplicated mails.
The function in my previous mail has a na.rm argument, defaulting to FALSE, 
pass na.rm = TRUE to remove the NA's.


agg <- aggregate(cfs ~ date, df1, fun, na.rm = TRUE)

Or simply change the default. I prefer to set na.rm = FALSE because that's 
what R's functions do. And I will only be used to one default, with base R 
functions or my own code.



Hope this helps,


Rui,

Again, yes it does.

Regards,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-29 Thread Jeff Newmiller
You may find something useful on handling timestamp data here: 
https://jdnewmil.github.io/

On August 29, 2021 9:23:31 AM PDT, Jeff Newmiller  
wrote:
>The general idea is to create a "grouping" column with repeated values for 
>each day, and then to use aggregate to compute your combined results. The 
>dplyr package's group_by/summarise functions can also do this, and there are 
>also proponents of the data.table package which is high performance but tends 
>to depend on altering data in-place unlike most other R data handling 
>functions.
>
>Also pay attention to missing data... if you have any then you will need to 
>consider whether you want the strictness of na.rm=FALSE or permissiveness of 
>na.rm=TRUE for your aggregation functions.
>
>On August 29, 2021 8:08:58 AM PDT, Rich Shepard  
>wrote:
>>I have a year's hydraulic data (discharge, stage height, velocity, etc.)
>>from a USGS monitoring gauge recording values every 5 minutes. The data
>>files contain 90K-93K lines and plotting all these data would produce a
>>solid block of color.
>>
>>What I want are the daily means and standard deviation from these data.
>>
>>As an occasional R user (depending on project needs) I've no idea what
>>packages could be applied to these data frames. There likely are multiple
>>paths to extracting these daily values so summary statistics can be
>>calculated and plotted. I'd appreciate suggestions on where to start to
>>learn how I can do this.
>>
>>TIA,
>>
>>Rich
>>
>>__
>>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>and provide commented, minimal, self-contained, reproducible code.
>

-- 
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-29 Thread Rich Shepard

On Sun, 29 Aug 2021, Jeff Newmiller wrote:


The general idea is to create a "grouping" column with repeated values for
each day, and then to use aggregate to compute your combined results. The
dplyr package's group_by/summarise functions can also do this, and there
are also proponents of the data.table package which is high performance
but tends to depend on altering data in-place unlike most other R data
handling functions.

Also pay attention to missing data... if you have any then you will need
to consider whether you want the strictness of na.rm=FALSE or
permissiveness of na.rm=TRUE for your aggregation functions.


Jeff,

Thank you. Yes, there are missing data as sometimes the equipment fails, or
there's some other reason why some samples are missing.

Grouping on each day is just what I need. I'll re-learn dplyr and take a
look at data.table.

Regards,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-29 Thread Rich Shepard

On Sun, 29 Aug 2021, Eric Berger wrote:


Provide dummy data (e.g. 5-10 lines), say like the contents of a csv file,
and calculate by hand what you'd like to see in the plot. (And describe
what the plot would look like.)


Eric,

Mea culpa! I extracted a set of sample data and forgot to include it in the
message. Here it is:

date,time,cfs
2020-08-26,09:30,136000
2020-08-26,09:35,126000
2020-08-26,09:40,13
2020-08-26,09:45,128000
2020-08-26,09:50,126000
2020-08-26,09:55,125000
2020-08-26,10:00,121000
2020-08-26,10:05,117000
2020-08-26,10:10,12
...
2020-08-26,23:10,108000
2020-08-26,23:15,96200
2020-08-26,23:20,86700
2020-08-26,23:25,103000
2020-08-26,23:30,103000
2020-08-26,23:35,99500
2020-08-26,23:40,85200
2020-08-26,23:45,103000
2020-08-26,23:50,95800
2020-08-26,23:55,88200

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-29 Thread Jeff Newmiller
The general idea is to create a "grouping" column with repeated values for each 
day, and then to use aggregate to compute your combined results. The dplyr 
package's group_by/summarise functions can also do this, and there are also 
proponents of the data.table package which is high performance but tends to 
depend on altering data in-place unlike most other R data handling functions.

Also pay attention to missing data... if you have any then you will need to 
consider whether you want the strictness of na.rm=FALSE or permissiveness of 
na.rm=TRUE for your aggregation functions.

On August 29, 2021 8:08:58 AM PDT, Rich Shepard  
wrote:
>I have a year's hydraulic data (discharge, stage height, velocity, etc.)
>from a USGS monitoring gauge recording values every 5 minutes. The data
>files contain 90K-93K lines and plotting all these data would produce a
>solid block of color.
>
>What I want are the daily means and standard deviation from these data.
>
>As an occasional R user (depending on project needs) I've no idea what
>packages could be applied to these data frames. There likely are multiple
>paths to extracting these daily values so summary statistics can be
>calculated and plotted. I'd appreciate suggestions on where to start to
>learn how I can do this.
>
>TIA,
>
>Rich
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Calculate daily means from 5-minute interval data

2021-08-29 Thread Eric Berger
Hi Rich,
Your request is a bit open-ended but here's a suggestion that might help
get you an answer.
Provide dummy data (e.g. 5-10 lines), say like the contents of a csv file,
and calculate by hand what you'd like to see in the plot. (And describe
what the plot would look like.)
It sounds like what you want could be done in a few lines of R code which
would work both on the dummy
data and the real data.

HTH,
Eric


On Sun, Aug 29, 2021 at 6:09 PM Rich Shepard 
wrote:

> I have a year's hydraulic data (discharge, stage height, velocity, etc.)
> from a USGS monitoring gauge recording values every 5 minutes. The data
> files contain 90K-93K lines and plotting all these data would produce a
> solid block of color.
>
> What I want are the daily means and standard deviation from these data.
>
> As an occasional R user (depending on project needs) I've no idea what
> packages could be applied to these data frames. There likely are multiple
> paths to extracting these daily values so summary statistics can be
> calculated and plotted. I'd appreciate suggestions on where to start to
> learn how I can do this.
>
> TIA,
>
> Rich
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Calculate daily means from 5-minute interval data

2021-08-29 Thread Rich Shepard

I have a year's hydraulic data (discharge, stage height, velocity, etc.)
from a USGS monitoring gauge recording values every 5 minutes. The data
files contain 90K-93K lines and plotting all these data would produce a
solid block of color.

What I want are the daily means and standard deviation from these data.

As an occasional R user (depending on project needs) I've no idea what
packages could be applied to these data frames. There likely are multiple
paths to extracting these daily values so summary statistics can be
calculated and plotted. I'd appreciate suggestions on where to start to
learn how I can do this.

TIA,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.