Re: [R] Finding non-normal distributions per row of data frame?

Dennis Murphy Fri, 04 Feb 2011 13:40:19 -0800

Hi:

The problem you have, IMO, is the multiplicity of tests and the p-value
adjustments that need to be made for them. Here's a little simulation using
normal and exponential distributions of the size you're contemplating.

# Normal data
d <- matrix(rnorm(400000), nrow = 20000)
# Function to compute the Shapiro-Wilk test and return its statistic and
p-value
f <- function(x) {
       w <- shapiro.test(x)
       c(teststat = w$stat, pval = w$p.value)
     }

# Apply f to the rows of d, return result as a data frame
u <- as.data.frame(t(apply(d, 1, f)))
# Use the p.adjust() function to correct p-values for multiplicity of
testing
u$padj <- p.adjust(u$pval)
length(u$pval[u$pval < 0.05])
[1] 1040      # in the ballpark of expectations on comparisonwide basis
> length(u$pval[u$pval < 0.0001])
[1] 0            # ditto
> length(u$padj[u$padj < 0.05])
[1] 0            # how many adjusted p-values are below the familywise rate
>
# Repeat with standard exponential samples:
dd <- matrix(rexp(400000), nrow = 20000)
u2 <- as.data.frame(t(apply(dd, 1, f)))
u2$padj <- p.adjust(u2$pval)
> length(u2$pval[u2$pval < 0.05])
[1] 16670          # about 83.5% rejected on comparisonwise basis
> length(u2$pval[u2$pval < 0.0001])
[1] 2493            # about 12.5% with comparisonwise p-values < 0.0001
> length(u2$pval[u2$padj < 0.05])
[1] 262              # about 1.3% significant at 5% familywise rate after
adjustment for multiplicity

You can probably amp this up a little with a different method of p-value
adjustment, but the central point is that if you're expecting to find
'nuggets' out of 20,000 tests, those samples will have to be extremely
different from a normal distribution.

To take up Greg Snow's comment about sampling p-values, observe that if the
normality assumption holds (i.e., the null hypothesis in the Shapiro-Wilk
test is true), then the p-values follow a Uniform(0, 1) distribution.

w1 <- runif(20000)
> length(w1[w1 < 0.05])
[1] 1047   # comparable to the simulation with normal samples above
> length(w1[w1 < 0.0001])
[1] 1         # ditto
w2 <- p.adjust(w1)    # adjust for multiplicity
> length(w2[w2 < 0.05])
[1] 0        # same as for the normal sample simulation

Like Greg Snow, I haven't seen any justification for why you think these
tests are useful or meaningful scientifically. If you're using something
like a t-test to make a decision of some sort, for example, then the
independence assumption has far more serious implications on the statistical
properties of the t-test than does the normality assumption.

This simulation, and your series of tests, assumes that the individual
samples are independent. I know little or nothing about microarrays, but is
it plausible to believe that samples from different locations on a
microarray are independent?

HTH,
Dennis

On Fri, Feb 4, 2011 at 11:38 AM, DB1984 <dannyb...@gmail.com> wrote:

>
> Thanks Peter.
>
> I understand your point, and that there is potentially a high false
> discovery rate - but I'd expect the interesting data points (genes on a
> microarray) to be within that list too. The next step would be to filter
> based on some greater understanding of the biology...
>
>
> Alternative approaches that come to mind are to look at the magnitude of
> the
> deviation - through Q-Q plot residuals, or to perform a linear regression
> on
> each row, and select those rows for which the coefficients fit predefined
> criteria. I'm still feeling my way into how to do this, though.
>
> Is there a better approach to identifying non-normal or skewed
> distributions
> that I am missing? Thanks for your advice...
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Finding-non-normal-distributions-per-row-of-data-frame-tp3259439p3260881.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Finding non-normal distributions per row of data frame?

Reply via email to