Could you be more explicit?
What models are these, how do I use them to track how similar two items are?

I'm essentially working with a custom-tailored RowSimilarityJob after
filtering out users with too many items first.


On Fri, Jun 21, 2013 at 12:35 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Well, you are still stuck with the problem that pulling more bits out of
> the small count data is a bad idea.
>
> Most of the models that I am partial to never even honestly estimate
> probabilities.  They just include or exclude features and then weight rare
> features higher than common.
>
> This is easy to do across days and very easy to have different days
> contribute differently.
>
>
>
> On Fri, Jun 21, 2013 at 10:13 AM, Dan Filimon
> <dangeorge.fili...@gmail.com>wrote:
>
> > The thing is there's no real model for which these are features.
> > I'm looking for pairs of similar items (and eventually groups). I'd like
> a
> > probabilistic interpretation of how similar two items are. Something like
> > "what is the probability that a user that likes one will also like the
> > other?".
> >
> > Then, with these probabilities per day, I'd combine them over the course
> of
> > multiple days by "pulling" the older probabilities towards 0.5: alpha *
> 0.5
> > + (1 - alpha) * p would be the linear approach to combining this where
> > alpha is 0 for the most recent day and larger for older ones. Then, I'd
> > take the average of those estimates.
> > The result would in my mind be a "smoothed" probability.
> >
> > Then, I'd get the top k per item from these.
> >
> >
> >
> > On Fri, Jun 21, 2013 at 11:45 AM, Ted Dunning <ted.dunn...@gmail.com>
> > wrote:
> >
> > > On Fri, Jun 21, 2013 at 8:25 AM, Dan Filimon <
> > dangeorge.fili...@gmail.com
> > > >wrote:
> > >
> > > > Thanks for the reference! I'll take a look at chapter 7, but let me
> > first
> > > > describe what I'm trying to achieve.
> > > >
> > > > I'm trying to identify interesting pairs, the anomalous
> co-occurrences
> > > with
> > > > the LLR. I'm doing this for a day's data and I want to keep the
> > p-values.
> > > > I then want to use the p-values to compute some overall probability
> > over
> > > > the course of multiple days to increase confidence in what I think
> are
> > > the
> > > > interesting pairs.
> > > >
> > >
> > > You can't reliably combine p-values this way (repeated comparisons and
> > all
> > > that).
> > >
> > > Also, in practice if you take the top 50-100 indicators of this sort
> the
> > > p-values will be so astronomically small that frequentist tests of
> > > significance are ludicrous.
> > >
> > > That said, the assumptions underlying the tests are really a much
> bigger
> > > problem.  The interesting problems of the world are often highly
> > > non-stationary which can lead to all kinds of problems in interpreting
> > > these results.  What does it mean if something shows a 10^-20 p value
> one
> > > day and a 0.2 value the next? Are you going to multiply them?  Or just
> > say
> > > that something isn't quite the same?  But how do you avoid comparing
> > > p-values in this case which is a famously bad practice.
> > >
> > > To my mind, the real problem here is that we are simply asking the
> wrong
> > > question.  We shouldn't be asking about individual features.  We should
> > be
> > > asking about overall model performance.  You *can* measure real-world
> > > performance and you *can* put error bars around that performance and
> you
> > > *can* see changes and degradation in that performance.  All of those
> > > comparisons are well-founded and work great.  Whether the model has
> > > selected too many or too few variables really is a diagnostic matter
> that
> > > has little to do with answering the question of whether the model is
> > > working well.
> > >
> >
>

Reply via email to