Preliminary response:
On 22 Nov 1999 [EMAIL PROTECTED] wrote:
> I am looking for a way to characterize a set of data - each set consists
> of many thousands of data points spanning a wide range over three
> sometimes four orders of magnitudes.
This leads one to wonder whether the original variables, or their
logarithms, would be the more appropriate metric for descriptive purposes.
What do the distributions look like?
> This analysis will then be used by others to examine their own
> experiments - will allow them to compare their results with ours - Are
> they similar? Different? with a confidence level of 95%
> Excuse my ignorance - I have calculated many Student's T Test,
> p values - but have never handled so much data as I am now!
>
> Assume for the purpose of this discussion that we have a data set
> defined by X(I,J) where I = 1,3000 and J=1,10
>
> (J will therefore be the different data sets, I will refer to
> points within each data set)
>
> Calculate the mean and the standard deviation at each point -
> i.e. calculate average and standard deviation at each I for
> J from 1 to 10
This would imply that the only characteristics of interest are
the mean and s.d. Is that really so? (If the 10 distributions are all
(approximately) Gaussian (aka "normal"), these are all you need; but if
they are not, rather more descriptive information is probably needed.
As remarked above, the fact that your range extends over several orders
of magnitude seems to suggest that the distributions probably are NOT
Gaussian.
> Now if I (or someone else) do an eleventh experiment, i.e. J = 11, how
> will I know with some confidence that this 11th experiment is "similar"
> to the 1st 10 experiments? Is it similar (95 %) if the value at each I
> falls within 2 standard deviations of the mean for that I? (I am
> making the assumption that the errors at each point are truly random
> i.e. normally distributed)
I don't see how one could tell in advance; so I'll rephrase the question
to "how can I tell whether this 11th experiment was "similar"...?"
Sounds to me as though you'd want to test the formal hypothesis that the
11th mean is equal to the mean of the previous 10 experiments.
On reflection, I see that I've been assuming that each of your 10 data
sets contains univariate values whose distribution is of interest; but
your description is also consistent with having multivariate values, or
(what is not quite the same thing) having a clutch of subsets, each of
which is of interest for its own distribution (or parameters thereof).
If those several orders of magnitude arise from several systematic
differences within each data set, then a less simple-minded approach than
what I've outlined above would surely be called for.
(In particular, if you're dealing with some industrial chemical process,
there may be different temporal regimes to be dealt with: startup,
coming to equilibrium, equilibrium, shutdown, for (surely over-simple)
example.)
-- DFB.
------------------------------------------------------------------------
Donald F. Burrill [EMAIL PROTECTED]
348 Hyde Hall, Plymouth State College, [EMAIL PROTECTED]
MSC #29, Plymouth, NH 03264 603-535-2597
184 Nashua Road, Bedford, NH 03110 603-471-7128