Re: Log-likelihood based correlation test?

Noelia Osés Fernández Mon, 20 Nov 2017 04:22:24 -0800

This thread is very enlightening, thank you very much!

Is there a way I can see what the P, PtP, and PtL matrices of an app are?
In the handmade case, for example?


Are there any pio calls I can use to get these?

On 17 November 2017 at 19:52, Pat Ferrel <p...@occamsmachete.com> wrote:

> Mahout builds the model by doing matrix multiplication (PtP) then
> calculating the LLR score for every non-zero value. We then keep the top K
> or use a threshold to decide whether to keep of not (both are supported in
> the UR). LLR is a metric for seeing how likely 2 events in a large group
> are correlated. Therefore LLR is only used to remove weak data from the
> model.
>
> So Mahout builds the model then it is put into Elasticsearch which is used
> as a KNN (K-nearest Neighbors) engine. The LLR score is not put into the
> model only an indicator that the item survived the LLR test.
>
> The KNN is applied using the user’s history as the query and finding items
> the most closely match it. Since PtP will have items in rows and the row
> will have correlating items, this “search” methods work quite well to find
> items that had very similar items purchased with it as are in the user’s
> history.
>
> =============================== that is the simple explanation
> ========================================
>
> Item-based recs take the model items (correlated items by the LLR test) as
> the query and the results are the most similar items—the items with most
> similar correlating items.
>
> The model is items in rows and items in columns if you are only using one
> event. PtP. If you think it through, it is all purchased items in as the
> row key and other items purchased along with the row key. LLR filters out
> the weakly correlating non-zero values (0 mean no evidence of correlation
> anyway). If we didn’t do this it would be purely a “Cooccurrence”
> recommender, one of the first useful ones. But filtering based on
> cooccurrence strength (PtP values without LLR applied to them) produces
> much worse results than using LLR to filter for most highly correlated
> cooccurrences. You get a similar effect with Matrix Factorization but you
> can only use one type of event for various reasons.
>
> Since LLR is a probabilistic metric that only looks at counts, it can be
> applied equally well to PtV (purchase, view), PtS (purchase, search terms),
> PtC (purchase, category-preferences). We did an experiment using Mean
> Average Precision for the UR using video “Likes” vs “Likes” and “Dislikes”
> so LtL vs. LtL and LtD scraped from rottentomatoes.com reviews and got a
> 20% lift in the MAP@k score by including data for “Dislikes”.
> https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-
> occurences/
>
> So the benefit and use of LLR is to filter weak data from the model and
> allow us to see if dislikes, and other events, correlate with likes. Adding
> this type of data, that is usually thrown away is one the the most powerful
> reasons to use the algorithm—BTW the algorithm is called Correlated
> Cross-Occurrence (CCO).
>
> The benefit of using Lucene (at the heart of Elasticsearch) to do the KNN
> query is that is it fast, taking the user’s realtime events into the query
> but also because it is is trivial to add all sorts or business rules. like
> give me recs based on user events but only ones from a certain category, of
> give me recs but only ones tagged as “in-stock” in fact the business rules
> can have inclusion rules, exclusion rules, and be mixed with ANDs and ORs.
>
> BTW there is a version ready for testing with PIO 0.12.0 and ES5 here:
> https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT 
> Instructions
> in the readme and notice it is in the 0.7.0-SNAPSHOT branch.
>
>
> On Nov 17, 2017, at 7:59 AM, Andrew Troemner <atroem...@salesforce.com>
> wrote:
>
> I'll echo Dan here. He and I went through the raw Mahout libraries called
> by the Universal Recommender, and while Noelia's description is accurate
> for an intermediate step, the indexing via ElasticSearch generates some
> separate relevancy scores based on their Lucene indexing scheme. The raw
> LLR scores are used in building this process, but the final scores served
> up by the API's should be post-processed, and cannot be used to reconstruct
> the raw LLR's (to my understanding).
>
> There are also some additional steps including down-sampling, which scrubs
> out very rare combinations (which otherwise would have very high LLR's for
> a single observation), which partially corrects for the statistical problem
> of multiple detection. But the underlying logic is per Ted Dunning's
> research and summarized by Noelia, and is a solid way to approach
> interaction effects for tens of thousands of items and including secondary
> indicators (like demographics, or implicit preferences).
>
>
> *ANDREW TROEMNER*Associate Principal Data Scientist | salesforce.com
> Office: 317.832.4404
> Mobile: 317.531.0216
>
>
>
> <http://smart.salesforce.com/sig/atroemner//us_mb_kb/default/link.html>
>
> On Fri, Nov 17, 2017 at 9:55 AM, Daniel Gabrieli <dgabri...@salesforce.com
> > wrote:
>
>> Maybe someone can correct me if I am wrong but in the code I believe
>> Elasticsearch is used instead of "resulting LLR is what goes into the AB
>> element in matrix PtP or PtL."
>>
>> By default the strongest 50 LLR scores get set as searchable values in
>> Elasticsearch per item-event pair.
>>
>> You can configure the thresholds for significance using the configuration
>> parameters: maxCorrelatorsPerItem or minLLR.  And this configuration is
>> important because at default of 50 you may end up treating all "indicator
>> values" as significant.  More info here: http://actionml.com/docs
>> /ur_config
>>
>>
>>
>> On Fri, Nov 17, 2017 at 4:50 AM Noelia Osés Fernández <
>> no...@vicomtech.org> wrote:
>>
>>>
>>> Let's see if I've understood how LLR is used in UR. Let P be the matrix
>>> for the primary conversion indicator (say purchases) and Pt its transposed.
>>>
>>>
>>> Then, with a second matrix, which can be P again to make PtP or a matrix
>>> for a secondary indicator (say L for likes) to make PtL, we take a row from
>>> Pt (item A) and a column from the second matrix (either P or L, in this
>>> example) (item B) and we calculate the table that Ted Dunning explains on
>>> his webpage: the number of coocurrences that item A *AND* B have been
>>> purchased (or purchased AND liked), the number of times that item A *OR*
>>>  B have been purchased (or purchased OR liked), and the number of times
>>> that *neither* item A nor B have been purchased (or purchased or
>>> liked). With this counts we calculate LLR following the formulas that Ted
>>> Dunning provides and the resulting LLR is what goes into the AB element in
>>> matrix PtP or PtL. Correct?
>>>
>>> Thank you!
>>>
>>> On 16 November 2017 at 17:03, Noelia Osés Fernández <no...@vicomtech.org
>>> > wrote:
>>>
>>>> Wonderful! Thanks Daniel!
>>>>
>>>> Suneel, I'm still new to the Apache ecosystem and so I know that Mahout
>>>> is used but only vaguely... I still don't know the different parts well
>>>> enough to have a good understanding of what each of them do (Spark, MLLib,
>>>> PIO, Mahout,...)
>>>>
>>>> Thank you both!
>>>>
>>>> On 16 November 2017 at 16:59, Suneel Marthi <smar...@apache.org> wrote:
>>>>
>>>>> Indeed so. Ted Dunning is an Apache Mahout PMC and committer and the
>>>>> whole idea of Search-based Recommenders stems from his work and insights.
>>>>> If u didn't know, the PIO UR uses Apache Mahout under the hood and hence u
>>>>> see the LLR.
>>>>>
>>>>> On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli <dgabrieli@
>>>>> salesforce.com> wrote:
>>>>>
>>>>>> I am pretty sure the LLR stuff in UR is based off of this blog post
>>>>>> and associated paper:
>>>>>>
>>>>>> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
>>>>>>
>>>>>> Accurate Methods for the Statistics of Surprise and Coincidence
>>>>>> by Ted Dunning
>>>>>>
>>>>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández <
>>>>>> no...@vicomtech.org> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I've been trying to understand how the UR algorithm works and I
>>>>>>> think I have a general idea. But I would like to have a *mathematical
>>>>>>> description* of the step in which the LLR comes into play. In the
>>>>>>> CCO presentations I have found it says:
>>>>>>>
>>>>>>> (PtP) compares column to column using
>>>>>>> *log-likelihood based correlation test*
>>>>>>>
>>>>>>> However, I have searched for "log-likelihood based correlation test"
>>>>>>> in google but no joy. All I get are explanations of the likelihood-ratio
>>>>>>> test to compare two models.
>>>>>>>
>>>>>>> I would very much appreciate a math explanation of log-likelihood
>>>>>>> based correlation test. Any pointers to papers or any other literature 
>>>>>>> that
>>>>>>> explains this specifically are much appreciated.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Noelia
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "actionml-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to actionml-user+unsubscr...@googlegroups.com.
> To post to this group, send email to actionml-u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.
> com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%
> 3DEhrO9qeOiKyWXA%40mail.gmail.com
> <https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>


-- 
<http://www.vicomtech.org>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

no...@vicomtech.org
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

<https://www.linkedin.com/company/vicomtech>
<https://www.youtube.com/user/VICOMTech>
<https://twitter.com/@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

Re: Log-likelihood based correlation test?

Reply via email to