On Tue, 23 Nov 1999, Donald F. Burrill wrote:
> Preliminary response:
> On 22 Nov 1999 [EMAIL PROTECTED] wrote:
> > I am looking for a way to characterize a set of data - each set consists
> > of many thousands of data points spanning a wide range over three
> > sometimes four orders of magnitudes.
> This leads one to wonder whether the original variables, or their
> logarithms, would be the more appropriate metric for descriptive purposes.
> What do the distributions look like?
Will try to answer this -
A typical data set is such that if the maximum value is 30,000
most (90 to 95%) of values are less than 10 percent of this max -
i.e. most values are "small" - (I am looking at cDNA microarray
data - this data tells us of the relative levels of different
messenger RNA in cells - so while there are a few mRNA's present
at high levels, most of the messages are at small levels)
> This would imply that the only characteristics of interest are
> the mean and s.d. Is that really so? (If the 10 distributions are all
> (approximately) Gaussian (aka "normal"), these are all you need; but if
> they are not, rather more descriptive information is probably needed.
> As remarked above, the fact that your range extends over several orders
> of magnitude seems to suggest that the distributions probably are NOT
> Gaussian.
I am sure there is a straightforward way to see if a given
distribution behaves "normally"?? I know the equation for the
normal distribution - i.e. given sigma and mean, I can draw
the Bell Curve - I am trying to understand something called
the QQ plot?() to check for 'normality"
> On reflection, I see that I've been assuming that each of your 10 data
> sets contains univariate values whose distribution is of interest; but
> your description is also consistent with having multivariate values, or
> (what is not quite the same thing) having a clutch of subsets, each of
> which is of interest for its own distribution (or parameters thereof).
> If those several orders of magnitude arise from several systematic
> differences within each data set, then a less simple-minded approach than
> what I've outlined above would surely be called for.
>
The difference in magnitudes of the mRNA levels _may be_systematic -
Let me explain a bit more here -
A given experiment consists of approximately 5000 data points of
different intensities - intensities vary from background values
of between 7 to 14 to about 30,000 or so - the intensities are
proportional to the amount of messenger RNA (mostly, there are
exceptions) - these 5000 or so points are obtained by
throwing a bunch of messenger RNA against a membrane that
contains 5000 or so genes - replicate experiments are done with
different membranes, different mRNA preparations - if we look at
a given spot on the membrane, the intensity varies from
experiment to experiment - the question is how do we
characterize these variations? - these variations can be due
to different membrane lots, different RNA's, different people doing
the experiment (!) etc, etc -
I apologize if this is long winded - but short of a real
explanation -