Re: Total number of events in predictionio are showing less then the actual events

2017-11-23 Thread Abhimanyu Nagrath
Hi Pat,

I dont think hbase TTL is the issue because

   1. I added the data 1 day  back
   2. I have a simlar server running for 1.5 million events each having 6k
   feature having data 10 days old and its working fine.

Regards,
Abhimanyu

On Thu, Nov 23, 2017 at 10:58 PM, Pat Ferrel  wrote:

> My vague recollection is that HBase may mark things for removal but wait
> for certain operations before they are compacted. If this is the case I’m
> sure there is a way to get the correct count so this may be a question for
> the HBase list.
>
>
> On Nov 23, 2017, at 1:51 AM, Abhimanyu Nagrath 
> wrote:
>
> Done the same as you have mentioned but problem still ersists
>
>
>
>
> Regards,
> Abhimanyu
>
> On Thu, Nov 23, 2017 at 2:53 PM, Александр Лактионов  > wrote:
>
>> Hi Abhimanyu,
>>
>> try setting TTL for rows in your hbase table
>> it can be set in hbase shell:
>> alter 'pio_event:events_?', NAME => 'e', TTL => 
>> and then do the following in the shell:
>> major_compact 'pio_event:events_?'
>>
>> You can configure auto major compact: it will delete all the rows that
>> are older than TTL
>>
>> 23 нояб. 2017 г., в 12:19, Abhimanyu Nagrath 
>> написал(а):
>>
>> Hi,
>>
>> I am stuck at this point .How to identify the problem?
>>
>>
>> Regards,
>> Abhimanyu
>>
>> On Mon, Nov 20, 2017 at 11:08 AM, Abhimanyu Nagrath <
>> abhimanyunagr...@gmail.com> wrote:
>>
>>> Hi , I am new to predictionIO V 0.12.0 (elasticsearch - 5.2.1 , hbase -
>>> 1.2.6 , spark - 2.6.0) Hardware (244 GB RAM and Core - 32) . I have
>>> uploaded near about 1 million events(each containing 30k features) . while
>>> uploading I can see the size of hbase disk increasing and after all the
>>> events got uploaded the size of hbase disk is 567GB. In order to verify I
>>> ran the following commands
>>>
>>>  - pio-shell --with-spark --conf spark.network.timeout=1000
>>> --driver-memory 30G --executor-memory 21G --num-executors 7
>>> --executor-cores 3 --conf spark.driver.maxResultSize=4g --conf
>>> spark.executor.heartbeatInterval=1000
>>>  - import org.apache.predictionio.data.store.PEventStore
>>>  - val eventsRDD = PEventStore.find(appName="test")(sc)
>>>  - val c = eventsRDD.count()
>>> it shows event counts as 18944
>>>
>>> After that from the script through which I uploaded the events, I
>>> randomly queried with there events Id and I was getting that event.
>>>
>>> I don't know how to make sure that all the events uploaded by me are
>>> there in the app. Any help is appreciated.
>>>
>>>
>>> Regards,
>>> Abhimanyu
>>>
>>
>>
>>
>
>


Re: Log-likelihood based correlation test?

2017-11-23 Thread Pat Ferrel
Use the default. Tuning with a threshold is only for atypical data and unless 
you have a harness for cross-validation you would not know if you were making 
things worse or better. We have our own tools for this but have never had the 
need for threshold tuning. 

Yes, llrDownsampled(PtP) is the “model”, each doc put into Elasticsearch is a 
sparse representation of a row from it, along with those from PtV, PtC,… Each 
gets a “field” in the doc.


On Nov 22, 2017, at 6:16 AM, Noelia Osés Fernández  wrote:

Thanks Pat!

How can I tune the threshold?

And when you say "compare to each item in the model", do you mean each row in 
PtP?

On 21 November 2017 at 19:56, Pat Ferrel > wrote:
No PtP non-zero elements have LLR calculated. The highest scores in the row are 
kept, or ones above some threshold hte resst are removeda as “noise". These are 
put into the Elasticsearch model without scores. 

Elasticsearch compares the similarity of the user history to each item in the 
model to find the KNN similar ones. This uses OKAPI BM25 from Lucene, which has 
several benefits over pure cosines (it actually consists of adjustments to 
cosine) and we also use norms. With ES 5 we should see quality improvements due 
to this. 
https://www.elastic.co/guide/en/elasticsearch/guide/master/pluggable-similarites.html
 




On Nov 21, 2017, at 1:28 AM, Noelia Osés Fernández > wrote:

Pat,

If I understood your explanation correctly, you say that some elements of PtP 
are removed by the LLR (set to zero, to be precise). But the elements that 
survive are calculated by matrix multiplication. The final PtP is put into 
EleasticSearc and when we query for user recommendations ES uses KNN to find 
the items (the rows in PtP) that are most similar to the user's history.

If the non-zero elements of PtP have been calculated by straight matrix 
multiplication, and I'm assuming that the P matrix only has 0s and 1s to 
indicate which items have been purchased by which user, then the elements of 
PtP are either 0 or greater to or equal than 1. However, the scores I get are 
below 1.

So is the KNN using cosine similarity as a metric to calculate the closest 
neighbours? And is the results of this cosine similarity metric what is 
returned as a 'score'?

If it is, when it is greater than 1, is this because the different cosine 
similarities are added together i.e. PtP, PtL... ?

Thank you for all your valuable help!

On 17 November 2017 at 19:52, Pat Ferrel > wrote:
Mahout builds the model by doing matrix multiplication (PtP) then calculating 
the LLR score for every non-zero value. We then keep the top K or use a 
threshold to decide whether to keep of not (both are supported in the UR). LLR 
is a metric for seeing how likely 2 events in a large group are correlated. 
Therefore LLR is only used to remove weak data from the model.

So Mahout builds the model then it is put into Elasticsearch which is used as a 
KNN (K-nearest Neighbors) engine. The LLR score is not put into the model only 
an indicator that the item survived the LLR test.

The KNN is applied using the user’s history as the query and finding items the 
most closely match it. Since PtP will have items in rows and the row will have 
correlating items, this “search” methods work quite well to find items that had 
very similar items purchased with it as are in the user’s history.

=== that is the simple explanation 


Item-based recs take the model items (correlated items by the LLR test) as the 
query and the results are the most similar items—the items with most similar 
correlating items.

The model is items in rows and items in columns if you are only using one 
event. PtP. If you think it through, it is all purchased items in as the row 
key and other items purchased along with the row key. LLR filters out the 
weakly correlating non-zero values (0 mean no evidence of correlation anyway). 
If we didn’t do this it would be purely a “Cooccurrence” recommender, one of 
the first useful ones. But filtering based on cooccurrence strength (PtP values 
without LLR applied to them) produces much worse results than using LLR to 
filter for most highly correlated cooccurrences. You get a similar effect with 
Matrix Factorization but you can only use one type of event for various reasons.

Since LLR is a probabilistic metric that only looks at counts, it can be 
applied equally well to PtV (purchase, view), PtS (purchase, search terms), PtC 
(purchase, category-preferences). We did an experiment using Mean Average 
Precision for the UR using video “Likes” vs “Likes” and “Dislikes” so LtL vs. 
LtL and LtD scraped from rottentomatoes.com 

Re: Which template for predicting ratings?

2017-11-23 Thread Noelia Osés Fernández
May I please get an answer to this question? I have a project that depends
on the answer to this question.

Using the Recommendation template (https://github.com/apache/
incubator-predictionio-template-recommender) and the ecom recs template (
https://github.com/apache/incubator-predictionio-template-ecom-recommender)
*why
are the predictions outputted by the algorithm outside of the range of the
input data?*

*Are the predictions of this algorithm bounded?* How can I know what the
bounds are?

If not, how can I make the predictions be in the same range as the input
data?

Thank you very much!

On 14 November 2017 at 16:45, Noelia Osés Fernández 
wrote:

> Thanks Pat.
>
> I am now using the Recommendation template (http://predictionio.
> incubator.apache.org/templates/recommendation/quickstart/) (
> https://github.com/apache/incubator-predictionio-template-recommender). I
> believe this template uses MLlib ALS.
>
> I am using the movielens ratings data. In the sample that I'm using, the
> minimum rating is 0.5 and the max is 5.
>
> However, the predictions returned by the recommendation engine are above
> 5. For example:
>
> Recommendations for user: 1
>
> {"itemScores":[{"item":"2492","score":8.760136688429496},{"
> item":"103228","score":8.074123814810278},{"item":"2907","score":7.
> 659090305689766},{"item":"6755","score":7.65084600130184}]}
>
> Shouldn't these predictions be in the range from 0.5 to 5 ?
>
>
>
> On 13 November 2017 at 18:53, Pat Ferrel  wrote:
>
>> What I was saying is the UR can use ratings, but not predict them. Use
>> MLlib ALS recommenders if you want to predict them for all items.
>>
>>
>> On Nov 13, 2017, at 9:32 AM, Pat Ferrel  wrote:
>>
>> What we did in the article I attached is assume 1-2 is dislike, and 4-5
>> is like.
>>
>> These are treated as indicators and will produce a score from the
>> recommender but these do not relate to 1-5 scores.
>>
>> If you need to predict what the user would score an item MLlib ALS
>> templates will do it.
>>
>>
>>
>> On Nov 13, 2017, at 2:42 AM, Noelia Osés Fernández 
>> wrote:
>>
>> Hi Pat,
>>
>> I truly appreciate your advice.
>>
>> However, what to do with a client that is adamant that they want to
>> display the predicted ratings in the form of 1 to 5-stars? That's my case
>> right now.
>>
>> I will pose a more concrete question. *Is there any template for which
>> the scores predicted by the algorithm are in the same range as the ratings
>> in the training set?*
>>
>> Thank you very much for your help!
>> Noelia
>>
>> On 10 November 2017 at 17:57, Pat Ferrel  wrote:
>>
>>> Any of the Spark MLlib ALS recommenders in the PIO template gallery
>>> support ratings.
>>>
>>> However I must warn that ratings are not very good for recommendations
>>> and none of the big players use ratings anymore, Netflix doesn’t even
>>> display them. The reason is that your 2 may be my 3 or 4 and that people
>>> rate different categories differently. For instance Netflix found Comedies
>>> were rated lower than Independent films. There have been many solutions
>>> proposed and tried but none have proven very helpful.
>>>
>>> There is another more fundamental problem, why would you want to
>>> recommend the highest rated item? What do you buy on Amazon or watch on
>>> Netflix? Are they only your highest rated items. Research has shown that
>>> they are not. There was a whole misguided movement around ratings that
>>> affected academic papers and cross-validation metrics that has fairly well
>>> been discredited. It all came from the Netflix prize that used both.
>>> Netflix has since led the way in dropping ratings as they saw the things I
>>> have mentioned.
>>>
>>> What do you do? Categorical indicators work best (like, dislike)or
>>> implicit indicators (buy) that are unambiguous. If a person buys something,
>>> they like it, if the rate it 3 do they like it? I buy many 3 rated items on
>>> Amazon if I need them.
>>>
>>> My advice is drop ratings and use thumbs up or down. These are
>>> unambiguous and the thumbs down can be used in some cases to predict thumbs
>>> up: https://developer.ibm.com/dwblog/2017/mahout-spark-corre
>>> lated-cross-occurences/ This uses data from a public web site to show
>>> significant lift by using “like” and “dislike” in recommendations. This
>>> used the Universal Recommender.
>>>
>>>
>>> On Nov 10, 2017, at 5:02 AM, Noelia Osés Fernández 
>>> wrote:
>>>
>>>
>>> Hi all,
>>>
>>> I'm new to PredictionIO so I apologise if this question is silly.
>>>
>>> I have an application in which users are rating different items in a
>>> scale of 1 to 5 stars. I want to recommend items to a new user and give her
>>> the predicted rating in number of stars. Which template should I use to do
>>> this? Note that I need the predicted rating to be in the same range of 1 to
>>> 5 stars.
>>>
>>> Is it possible to do 

Re: Total number of events in predictionio are showing less then the actual events

2017-11-23 Thread Abhimanyu Nagrath
But when I run the command "count 'pio_event:events' " on hbase it shows me
all the row 1.5 million

On Thu, Nov 23, 2017 at 2:53 PM, Александр Лактионов 
wrote:

> Hi Abhimanyu,
>
> try setting TTL for rows in your hbase table
> it can be set in hbase shell:
> alter 'pio_event:events_?', NAME => 'e', TTL => 
> and then do the following in the shell:
> major_compact 'pio_event:events_?'
>
> You can configure auto major compact: it will delete all the rows that are
> older than TTL
>
> 23 нояб. 2017 г., в 12:19, Abhimanyu Nagrath 
> написал(а):
>
> Hi,
>
> I am stuck at this point .How to identify the problem?
>
>
> Regards,
> Abhimanyu
>
> On Mon, Nov 20, 2017 at 11:08 AM, Abhimanyu Nagrath <
> abhimanyunagr...@gmail.com> wrote:
>
>> Hi , I am new to predictionIO V 0.12.0 (elasticsearch - 5.2.1 , hbase -
>> 1.2.6 , spark - 2.6.0) Hardware (244 GB RAM and Core - 32) . I have
>> uploaded near about 1 million events(each containing 30k features) . while
>> uploading I can see the size of hbase disk increasing and after all the
>> events got uploaded the size of hbase disk is 567GB. In order to verify I
>> ran the following commands
>>
>>  - pio-shell --with-spark --conf spark.network.timeout=1000
>> --driver-memory 30G --executor-memory 21G --num-executors 7
>> --executor-cores 3 --conf spark.driver.maxResultSize=4g --conf
>> spark.executor.heartbeatInterval=1000
>>  - import org.apache.predictionio.data.store.PEventStore
>>  - val eventsRDD = PEventStore.find(appName="test")(sc)
>>  - val c = eventsRDD.count()
>> it shows event counts as 18944
>>
>> After that from the script through which I uploaded the events, I
>> randomly queried with there events Id and I was getting that event.
>>
>> I don't know how to make sure that all the events uploaded by me are
>> there in the app. Any help is appreciated.
>>
>>
>> Regards,
>> Abhimanyu
>>
>
>
>


Re: Total number of events in predictionio are showing less then the actual events

2017-11-23 Thread Александр Лактионов
Hi Abhimanyu,

try setting TTL for rows in your hbase table
it can be set in hbase shell:
alter 'pio_event:events_?', NAME => 'e', TTL => 
and then do the following in the shell:
major_compact 'pio_event:events_?'

You can configure auto major compact: it will delete all the rows that are 
older than TTL

> 23 нояб. 2017 г., в 12:19, Abhimanyu Nagrath  
> написал(а):
> 
> Hi,
> 
> I am stuck at this point .How to identify the problem?
> 
> 
> Regards,
> Abhimanyu
> 
> On Mon, Nov 20, 2017 at 11:08 AM, Abhimanyu Nagrath 
> > wrote:
> Hi , I am new to predictionIO V 0.12.0 (elasticsearch - 5.2.1 , hbase - 1.2.6 
> , spark - 2.6.0) Hardware (244 GB RAM and Core - 32) . I have uploaded near 
> about 1 million events(each containing 30k features) . while uploading I can 
> see the size of hbase disk increasing and after all the events got uploaded 
> the size of hbase disk is 567GB. In order to verify I ran the following 
> commands 
> 
>  - pio-shell --with-spark --conf spark.network.timeout=1000 
> --driver-memory 30G --executor-memory 21G --num-executors 7 --executor-cores 
> 3 --conf spark.driver.maxResultSize=4g --conf 
> spark.executor.heartbeatInterval=1000
>  - import org.apache.predictionio.data.store.PEventStore
>  - val eventsRDD = PEventStore.find(appName="test")(sc)
>  - val c = eventsRDD.count() 
> it shows event counts as 18944
> 
> After that from the script through which I uploaded the events, I randomly 
> queried with there events Id and I was getting that event.
> 
> I don't know how to make sure that all the events uploaded by me are there in 
> the app. Any help is appreciated.
> 
> 
> Regards,
> Abhimanyu
> 



Re: Total number of events in predictionio are showing less then the actual events

2017-11-23 Thread Abhimanyu Nagrath
Hi,

I am stuck at this point .How to identify the problem?


Regards,
Abhimanyu

On Mon, Nov 20, 2017 at 11:08 AM, Abhimanyu Nagrath <
abhimanyunagr...@gmail.com> wrote:

> Hi , I am new to predictionIO V 0.12.0 (elasticsearch - 5.2.1 , hbase -
> 1.2.6 , spark - 2.6.0) Hardware (244 GB RAM and Core - 32) . I have
> uploaded near about 1 million events(each containing 30k features) . while
> uploading I can see the size of hbase disk increasing and after all the
> events got uploaded the size of hbase disk is 567GB. In order to verify I
> ran the following commands
>
>  - pio-shell --with-spark --conf spark.network.timeout=1000
> --driver-memory 30G --executor-memory 21G --num-executors 7
> --executor-cores 3 --conf spark.driver.maxResultSize=4g --conf
> spark.executor.heartbeatInterval=1000
>  - import org.apache.predictionio.data.store.PEventStore
>  - val eventsRDD = PEventStore.find(appName="test")(sc)
>  - val c = eventsRDD.count()
> it shows event counts as 18944
>
> After that from the script through which I uploaded the events, I randomly
> queried with there events Id and I was getting that event.
>
> I don't know how to make sure that all the events uploaded by me are there
> in the app. Any help is appreciated.
>
>
> Regards,
> Abhimanyu
>