Could you be more explicit? What models are these, how do I use them to track how similar two items are?
I'm essentially working with a custom-tailored RowSimilarityJob after filtering out users with too many items first. On Fri, Jun 21, 2013 at 12:35 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > Well, you are still stuck with the problem that pulling more bits out of > the small count data is a bad idea. > > Most of the models that I am partial to never even honestly estimate > probabilities. They just include or exclude features and then weight rare > features higher than common. > > This is easy to do across days and very easy to have different days > contribute differently. > > > > On Fri, Jun 21, 2013 at 10:13 AM, Dan Filimon > <dangeorge.fili...@gmail.com>wrote: > > > The thing is there's no real model for which these are features. > > I'm looking for pairs of similar items (and eventually groups). I'd like > a > > probabilistic interpretation of how similar two items are. Something like > > "what is the probability that a user that likes one will also like the > > other?". > > > > Then, with these probabilities per day, I'd combine them over the course > of > > multiple days by "pulling" the older probabilities towards 0.5: alpha * > 0.5 > > + (1 - alpha) * p would be the linear approach to combining this where > > alpha is 0 for the most recent day and larger for older ones. Then, I'd > > take the average of those estimates. > > The result would in my mind be a "smoothed" probability. > > > > Then, I'd get the top k per item from these. > > > > > > > > On Fri, Jun 21, 2013 at 11:45 AM, Ted Dunning <ted.dunn...@gmail.com> > > wrote: > > > > > On Fri, Jun 21, 2013 at 8:25 AM, Dan Filimon < > > dangeorge.fili...@gmail.com > > > >wrote: > > > > > > > Thanks for the reference! I'll take a look at chapter 7, but let me > > first > > > > describe what I'm trying to achieve. > > > > > > > > I'm trying to identify interesting pairs, the anomalous > co-occurrences > > > with > > > > the LLR. I'm doing this for a day's data and I want to keep the > > p-values. > > > > I then want to use the p-values to compute some overall probability > > over > > > > the course of multiple days to increase confidence in what I think > are > > > the > > > > interesting pairs. > > > > > > > > > > You can't reliably combine p-values this way (repeated comparisons and > > all > > > that). > > > > > > Also, in practice if you take the top 50-100 indicators of this sort > the > > > p-values will be so astronomically small that frequentist tests of > > > significance are ludicrous. > > > > > > That said, the assumptions underlying the tests are really a much > bigger > > > problem. The interesting problems of the world are often highly > > > non-stationary which can lead to all kinds of problems in interpreting > > > these results. What does it mean if something shows a 10^-20 p value > one > > > day and a 0.2 value the next? Are you going to multiply them? Or just > > say > > > that something isn't quite the same? But how do you avoid comparing > > > p-values in this case which is a famously bad practice. > > > > > > To my mind, the real problem here is that we are simply asking the > wrong > > > question. We shouldn't be asking about individual features. We should > > be > > > asking about overall model performance. You *can* measure real-world > > > performance and you *can* put error bars around that performance and > you > > > *can* see changes and degradation in that performance. All of those > > > comparisons are well-founded and work great. Whether the model has > > > selected too many or too few variables really is a diagnostic matter > that > > > has little to do with answering the question of whether the model is > > > working well. > > > > > >