On Fri, Jun 21, 2013 at 8:25 AM, Dan Filimon <dangeorge.fili...@gmail.com>wrote:

> Thanks for the reference! I'll take a look at chapter 7, but let me first
> describe what I'm trying to achieve.
>
> I'm trying to identify interesting pairs, the anomalous co-occurrences with
> the LLR. I'm doing this for a day's data and I want to keep the p-values.
> I then want to use the p-values to compute some overall probability over
> the course of multiple days to increase confidence in what I think are the
> interesting pairs.
>

You can't reliably combine p-values this way (repeated comparisons and all
that).

Also, in practice if you take the top 50-100 indicators of this sort the
p-values will be so astronomically small that frequentist tests of
significance are ludicrous.

That said, the assumptions underlying the tests are really a much bigger
problem.  The interesting problems of the world are often highly
non-stationary which can lead to all kinds of problems in interpreting
these results.  What does it mean if something shows a 10^-20 p value one
day and a 0.2 value the next? Are you going to multiply them?  Or just say
that something isn't quite the same?  But how do you avoid comparing
p-values in this case which is a famously bad practice.

To my mind, the real problem here is that we are simply asking the wrong
question.  We shouldn't be asking about individual features.  We should be
asking about overall model performance.  You *can* measure real-world
performance and you *can* put error bars around that performance and you
*can* see changes and degradation in that performance.  All of those
comparisons are well-founded and work great.  Whether the model has
selected too many or too few variables really is a diagnostic matter that
has little to do with answering the question of whether the model is
working well.

Reply via email to