Re: [R] Working With Variables Having Different Lengths

David Winsemius Fri, 21 Oct 2011 16:48:16 -0700


On Oct 21, 2011, at 6:17 PM, Rich Shepard wrote:

On Fri, 21 Oct 2011, David Winsemius wrote:
What problem are you trying to solve?
What I need now is to compare TDS (total dissolved solids) withspecificconductivity and the ions that are normally comprise TDS. Beforerunning anyregression models I need to look at these data from three points ofview:all data from all sites within each hydrographic drainage basincollectedduring the past 30 years; average (or total) concentrations (not yetdecidedon what makes the most ecological sense) within a stream havingmultiple
collection sites; and by site within certain streams.

 Here is the data frame structure:

 str(chemdata)
'data.frame':   47244 obs. of  6 variables:
$ site : Factor w/ 143 levels "BC-0.5","BC-1",..: 134 134 134127 127
 $ sampdate: Date, format: "2006-12-06" "2006-12-06" ...
$ param : Factor w/ 66 levels "AGP","ANP","ANP/AGP",..: 58 66 1224 59 66$ quant : num 1.08e+04 7.95 1.80e-02 2.80e+02 1.90e+01 8.44 1.62e+03
 $ stream  : Factor w/ 24 levels "BCrk","CCrk",..: 4 4 4 21 21 21 4
$ basin : Factor w/ 2 levels "BasinEast","BasinWest": 1 1 1 1 1 11 1 1 2 ...

The only variable in that dataframe with what appears to be acontinuous value (which is how I would expect "total dissolved solids"to be measured) is "quant" Are you saying that the value of quant ismeasuring something with different units depending on the value of'param' and that 'site' and 'date' shoud be used to identifyassociated measurements? This would appear to be the case based onwhat you are saying below.

If this is so the problem is to break apart the dataframe by type ofmeasurement ('param') butone way would be to split into separatedataframes then merge back together by an appropriate linkage on siteand date. I'm guessing that 'stream' and 'basin' are superfluous forthe matching and can be later associated with 'site'?

The goal would be a dataframe with 7 renamed 'param' columns ('TDS','Cond', 'Mg', 'SO4', 'Cl', 'Na', and 'Ca') and two identifier columns('site' and 'sampdate'. For the moment I would think you would wantall the data together an not make any decisions about excluding NAvalues until you get an overall picture of the situation.


The first thing I would try would be

with(subset(chemdata, param %in% c('TDS', 'Cond', 'Mg', 'SO4', 'Cl','Na', and 'Ca') , 1:4) ,

     xtabs(quant ~ site + sampdate + param) )

You would get 7 tables One for each 'param' with up to 143 rows and asmany columns as you have sampdates.

This might be a good use for package reshape2 since it generallyreturns a dataframe. The above operation would return an array with 3dimensions. You might get immediate success with something like:

dcast( subset(chemdata, param %in% c('TDS', 'Cond', 'Mg', 'SO4', 'Cl','Na', and 'Ca') , 1:4) ,

     site + sampdate ~ param)
# the omitted varialble name should ent up in the values columns

To do your testing it might be wise to apply more selective use ofsubset. Perhaps on;u go for a few sites and dates.


--
david.

While all the data sets used in the books I've read are simpler andwellillustrate the analyses presented, what I've not read is guidance onhowcomplex data sets could (or should) be partitioned into smaller butstillrelated data sets to facilitate analyses. Or, how I extract therelevant
rows and columns for specific analyses.
That seems very unlikely. What we need is a clearer description ofthatvalues that your "param" variable can assume, and what you want towithincategories of those values. We also need you to stop droppingcontext.
There are 66 different chemicals in the param factor. However, fortheimmediate effort, only 7 are needed. They are coded 'TDS', 'Cond','Mg',
'SO4', 'Cl', 'Na', and 'Ca'.
From the database table I know the number of non-NULL (non-NA) rowsfor
each parameter:

        TDS     2181
        Cond     820
        Mg      1120
        SO4     1980
        Cl      1971
        Na       866
        Ca      1110
Not all were required to be measured at all sites from thebeginning in1981. I do not yet know how many rows have non-NULL values for the 6pairs
compared with TDS.

 If there's more information to provide I'll gladly do so.

Thanks,

Rich

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Working With Variables Having Different Lengths

Reply via email to