from:"\"Pat Ferrel\""

RE: Problem with training in yarn cluster

2018-05-22 Thread Pat Ferrel

Actually you might search the archives for “yarn” because I don’t recall
how the setup works off hand.

Archives here:
https://lists.apache.org/list.html?user@predictionio.apache.org

Also check the Spark Yarn requirements and remember that `pio train … --
various Spark params` allows you to pass arbitrary Spark params exactly as
you would to spark-submit on the pio command line. The double dash
separates PIO and Spark params.


From: Pat Ferrel  
Reply: user@predictionio.apache.org 

Date: May 22, 2018 at 4:07:38 PM
To: user@predictionio.apache.org 
, Wojciech Kowalski 

Subject:  RE: Problem with training in yarn cluster

What is the command line for `pio train …` Specifically are you using
yarn-cluster mode? This causes the driver code, which is a PIO process, to
be executed on an executor. Special setup is required for this.


From: Wojciech Kowalski  
Reply: user@predictionio.apache.org 

Date: May 22, 2018 at 2:28:43 PM
To: user@predictionio.apache.org 

Subject:  RE: Problem with training in yarn cluster

Hello,



Actually I have another error in logs that is actually preventing train as
well:



[INFO] [RecommendationEngine$]



   _   _ __  __ _

 /\   | | (_)   |  \/  | |

/  \   ___| |_ _  ___  _ __ | \  / | |

   / /\ \ / __| __| |/ _ \| '_ \| |\/| | |

  /  \ (__| |_| | (_) | | | | |  | | |

 /_/\_\___|\__|_|\___/|_| |_|_|  |_|__|







[INFO] [Engine] Extracting datasource params...

[INFO] [WorkflowUtils$] No 'name' is found. Default empty String will be used.

[INFO] [Engine] Datasource params:
(,DataSourceParams(shop_live,List(purchase, basket-add, wishlist-add,
view),None,None))

[INFO] [Engine] Extracting preparator params...

[INFO] [Engine] Preparator params: (,Empty)

[INFO] [Engine] Extracting serving params...

[INFO] [Engine] Serving params: (,Empty)

[INFO] [log] Logging initialized @6774ms

[INFO] [Server] jetty-9.2.z-SNAPSHOT

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@1798eb08{/jobs,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@47c4c3cd{/jobs/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@3e080dea{/jobs/job,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@c75847b{/jobs/job/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@5ce5ee56{/stages,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@3dde94ac{/stages/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@4347b9a0{/stages/stage,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@63b1bbef{/stages/stage/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@10556e91{/stages/pool,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@5967f3c3{/stages/pool/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@2793dbf6{/storage,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@49936228{/storage/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@7289bc6d{/storage/rdd,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@1496b014{/storage/rdd/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@2de3951b{/environment,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@7f3330ad{/environment/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@40e681f2{/executors,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@61519fea{/executors/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@502b9596{/executors/threadDump,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@367b7166{/executors/threadDump/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@42669f4a{/static,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@2f25f623{/,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@23ae4174{/api,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@4e33e426{/jobs/job/kill,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@38d9ae65{/stages/stage/kill,null,AVAILABLE,@Spark}

[INFO] [ServerConnector] Started Spark@17239b3{HTTP/1.1}{0.0.0.0:47948}

[INFO] [Server] Started @7040ms

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@16cffbe4{/metrics/json,null,AVAILABLE,@Spark}

[WARN] [YarnSchedulerBackend$YarnSchedulerEndpoint] Attempted to
request executors before the AM has regi

RE: Problem with training in yarn cluster

2018-05-22 Thread Pat Ferrel

What is the command line for `pio train …` Specifically are you using 
yarn-cluster mode? This causes the driver code, which is a PIO process, to be 
executed on an executor. Special setup is required for this.

From: Wojciech Kowalski 
Reply: user@predictionio.apache.org 
Date: May 22, 2018 at 2:28:43 PM
To: user@predictionio.apache.org 
Subject:  RE: Problem with training in yarn cluster  

Hello,

Actually I have another error in logs that is actually preventing train as well:

[INFO] [RecommendationEngine$]  

   _   _ __  __ _
 /\   | | (_)   |  \/  | |
    /  \   ___| |_ _  ___  _ __ | \  / | |
   / /\ \ / __| __| |/ _ \| '_ \| |\/| | |
  /  \ (__| |_| | (_) | | | | |  | | |
 /_/    \_\___|\__|_|\___/|_| |_|_|  |_|__|

[INFO] [Engine] Extracting datasource params...
[INFO] [WorkflowUtils$] No 'name' is found. Default empty String will be used.
[INFO] [Engine] Datasource params: (,DataSourceParams(shop_live,List(purchase, 
basket-add, wishlist-add, view),None,None))
[INFO] [Engine] Extracting preparator params...
[INFO] [Engine] Preparator params: (,Empty)
[INFO] [Engine] Extracting serving params...
[INFO] [Engine] Serving params: (,Empty)
[INFO] [log] Logging initialized @6774ms
[INFO] [Server] jetty-9.2.z-SNAPSHOT
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@1798eb08{/jobs,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@47c4c3cd{/jobs/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@3e080dea{/jobs/job,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@c75847b{/jobs/job/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@5ce5ee56{/stages,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@3dde94ac{/stages/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@4347b9a0{/stages/stage,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@63b1bbef{/stages/stage/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@10556e91{/stages/pool,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@5967f3c3{/stages/pool/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@2793dbf6{/storage,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@49936228{/storage/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@7289bc6d{/storage/rdd,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@1496b014{/storage/rdd/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@2de3951b{/environment,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@7f3330ad{/environment/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@40e681f2{/executors,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@61519fea{/executors/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@502b9596{/executors/threadDump,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@367b7166{/executors/threadDump/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@42669f4a{/static,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@2f25f623{/,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@23ae4174{/api,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@4e33e426{/jobs/job/kill,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@38d9ae65{/stages/stage/kill,null,AVAILABLE,@Spark}
[INFO] [ServerConnector] Started Spark@17239b3{HTTP/1.1}{0.0.0.0:47948}
[INFO] [Server] Started @7040ms
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@16cffbe4{/metrics/json,null,AVAILABLE,@Spark}
[WARN] [YarnSchedulerBackend$YarnSchedulerEndpoint] Attempted to request 
executors before the AM has registered!
[ERROR] [ApplicationMaster] Uncaught exception:  

Thanks,

Wojciech

From: Wojciech Kowalski
Sent: 22 May 2018 23:20
To: user@predictionio.apache.org
Subject: Problem with training in yarn cluster

Hello, I am trying to setup distributed cluster with separate all services but 
i have problem while running train:

log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /pio/pio.log (No such file or directory)
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.(FileOutputStream.java:213)
    at java.

Re: UR: build/train/deploy once & querying for 3 use cases

2018-05-11 Thread Pat Ferrel

BTW The Universal Recommender has it’s own community support group here:
https://groups.google.com/forum/#!forum/actionml-user

From: Pat Ferrel  
Reply: user@predictionio.apache.org 

Date: May 11, 2018 at 10:07:25 AM
To: user@predictionio.apache.org 
, Nasos Papageorgiou

Subject:  Re: UR: build/train/deploy once & querying for 3 use cases

Yes but do you really care as a business about “users who viewed this also
viewed that”? I’d say no. You want to help them find what to buy and there
is a big difference between viewing and buying behavior. If you are only
interested in increasing time on site, or have ads shown that benefit from
more views then it might make more sense but a pure e-comm site would be
after sales.

The algorithm inside the UR can do all of these but only 1 and 2 are
possible with the current implementation. The Algorithm is call Correlated
Cross Occurrence and it can be targeted to recommend any recorded behavior.
On the theory that you would never want to throw away correlated behavior
in building models all behavior is taken into account so #1 could be
restated more precisely (but somewhat redundantly) as “people who viewed
(but then bought) this also viewed (and bought) these”. This targets what
you show people to “important” views. In fact if you are also using search
behavior and brand preferences it gets more wordy, “people who viewed this
(and bought, searched for, and preferred brands in a similar way) also
viewed” So you are showing viewed things that share the type of user like
the viewing user. You can just use one type of behavior, but why? Using all
makes the views more targeted.

So though it is possible to do 1-3 exactly as stated, you will get better
sales with the way described above.

Using my suggested method above #1 and #3 are the same.

   1. "eventNames”: [“view”, “buy”, “search”, “brand-pref”]
   2. "eventNames”: [ “buy”,“view”, “search”, “brand-pref”]
   3. "eventNames”: [“view”, “buy”, “search”, “brand-pref”]

If you want to do exactly as you have shown you’d have to throw out all
correlated cross-behavior.

   1. "eventNames”: [“view”]
   2. "eventNames”: [“buy”]
   3. "eventNames”: [“buy”, “view”] but then the internal model query would
   be only the current user’s view history. This is not supported in this
   exact form but could be added.

As you can see you are discarding a lot of valuable data if you insist on a
very pure interpretation of your 1-3 definitions, and I can promise you
that most knowledgable e-com sites do not mince words to finely.

From: Nasos Papageorgiou 

Reply: user@predictionio.apache.org 

Date: May 11, 2018 at 12:39:27 AM
To: user@predictionio.apache.org 

Subject:  Re: UR: build/train/deploy once & querying for 3 use cases

Just a correction:  File on the first bullet is engine.json (not
events.json).

2018-05-10 17:01 GMT+03:00 Nasos Papageorgiou :

>
>
> Hi all,
> to elaborate on these cases, the purpose is to create a UR for the cases
> of:
>
> 1.   “User who Viewed this item also Viewed”
>
> 2.   “User who Bought this item also Bought”
>
> 3.   “User who Viewed this item also Bought ”
>
> while having Events of Buying and Viewing a product.
> I would like to make some questions:
>
> 1.   On Data source Parameters, file: events.json: There is no matter
> on the sequence of the events which are defined. Right?
>
> 2.   If I specify one Event Type on the “eventNames” in Algorithm
> section (i.e. “view”)  and no event on the “blacklistEvents”,  is the
> second Event Type (i.e. “buy”) specified on the recommended list?
>
> 3.   If I use only "user" on the query, the "item case" will not be
> used for the recommendations. What is happening with the new users in
> that case?   Shall I use both "user" and "item" instead?
>
> 4.Values of less than 1 in “UserBias” and “ItemBias” on the query
> do not have any effect on the result.
>
> 5.Is it feasible to build/train/deploy only once, and query for
> all 3 use cases?
>
>
> 6.   How to make queries towards the different Apps because there is
> no any obvious way in the query parameters or the URL?
>
> Thank you.
>
>
>
> *From:* Pat Ferrel [mailto:p...@occamsmachete.com]
> *Sent:* Wednesday, May 09, 2018 4:41 PM
> *To:* user@predictionio.apache.org; gerasimos xydas
> *Subject:* Re: UR: build/train/deploy once & querying for 3 use cases
>
>
>
> Why do you want to throw away user behavior in making recommendations? The
> lift you get in purchases will be less.
>
>
>
> There is a use case for this when you are making recommendations basically
> inside a session where the user is browsing/viewing things on a hunt for
> something. In this case you would want to make recs using the user history
> o

Re: UR: build/train/deploy once & querying for 3 use cases

2018-05-11 Thread Pat Ferrel

Yes but do you really care as a business about “users who viewed this also
viewed that”? I’d say no. You want to help them find what to buy and there
is a big difference between viewing and buying behavior. If you are only
interested in increasing time on site, or have ads shown that benefit from
more views then it might make more sense but a pure e-comm site would be
after sales.

The algorithm inside the UR can do all of these but only 1 and 2 are
possible with the current implementation. The Algorithm is call Correlated
Cross Occurrence and it can be targeted to recommend any recorded behavior.
On the theory that you would never want to throw away correlated behavior
in building models all behavior is taken into account so #1 could be
restated more precisely (but somewhat redundantly) as “people who viewed
(but then bought) this also viewed (and bought) these”. This targets what
you show people to “important” views. In fact if you are also using search
behavior and brand preferences it gets more wordy, “people who viewed this
(and bought, searched for, and preferred brands in a similar way) also
viewed” So you are showing viewed things that share the type of user like
the viewing user. You can just use one type of behavior, but why? Using all
makes the views more targeted.

So though it is possible to do 1-3 exactly as stated, you will get better
sales with the way described above.

Using my suggested method above #1 and #3 are the same.

   1. "eventNames”: [“view”, “buy”, “search”, “brand-pref”]
   2. "eventNames”: [ “buy”,“view”, “search”, “brand-pref”]
   3. "eventNames”: [“view”, “buy”, “search”, “brand-pref”]

If you want to do exactly as you have shown you’d have to throw out all
correlated cross-behavior.

   1. "eventNames”: [“view”]
   2. "eventNames”: [“buy”]
   3. "eventNames”: [“buy”, “view”] but then the internal model query would
   be only the current user’s view history. This is not supported in this
   exact form but could be added.

As you can see you are discarding a lot of valuable data if you insist on a
very pure interpretation of your 1-3 definitions, and I can promise you
that most knowledgable e-com sites do not mince words to finely.


From: Nasos Papageorgiou 

Reply: user@predictionio.apache.org 

Date: May 11, 2018 at 12:39:27 AM
To: user@predictionio.apache.org 

Subject:  Re: UR: build/train/deploy once & querying for 3 use cases

Just a correction:  File on the first bullet is engine.json (not
events.json).

2018-05-10 17:01 GMT+03:00 Nasos Papageorgiou :

>
>
> Hi all,
> to elaborate on these cases, the purpose is to create a UR for the cases
> of:
>
> 1.   “User who Viewed this item also Viewed”
>
> 2.   “User who Bought this item also Bought”
>
> 3.   “User who Viewed this item also Bought ”
>
> while having Events of Buying and Viewing a product.
> I would like to make some questions:
>
> 1.   On Data source Parameters, file: events.json: There is no matter
> on the sequence of the events which are defined. Right?
>
> 2.   If I specify one Event Type on the “eventNames” in Algorithm
> section (i.e. “view”)  and no event on the “blacklistEvents”,  is the
> second Event Type (i.e. “buy”) specified on the recommended list?
>
> 3.   If I use only "user" on the query, the "item case" will not be
> used for the recommendations. What is happening with the new users in
> that case?   Shall I use both "user" and "item" instead?
>
> 4.Values of less than 1 in “UserBias” and “ItemBias” on the query
> do not have any effect on the result.
>
> 5.Is it feasible to build/train/deploy only once, and query for
> all 3 use cases?
>
>
> 6.   How to make queries towards the different Apps because there is
> no any obvious way in the query parameters or the URL?
>
> Thank you.
>
>
>
> *From:* Pat Ferrel [mailto:p...@occamsmachete.com]
> *Sent:* Wednesday, May 09, 2018 4:41 PM
> *To:* user@predictionio.apache.org; gerasimos xydas
> *Subject:* Re: UR: build/train/deploy once & querying for 3 use cases
>
>
>
> Why do you want to throw away user behavior in making recommendations? The
> lift you get in purchases will be less.
>
>
>
> There is a use case for this when you are making recommendations basically
> inside a session where the user is browsing/viewing things on a hunt for
> something. In this case you would want to make recs using the user history
> of views but you have to build a model of purchase as the primary indicator
> or you won’t get purchase recommendations and believe me recommending views
> is a road to bad results. People view many things they do not buy, putting
> only view behavior that lead to purchases in the model. So create a model
> with purchase as the primary indi

Re: UR evaluation

2018-05-10 Thread Pat Ferrel

Exactly, ranking is the only task of a recommender. Precision is not
automatically good at that but something like MAP@k is.


From: Marco Goldin  
Date: May 10, 2018 at 10:09:22 PM
To: Pat Ferrel  
Cc: user@predictionio.apache.org 

Subject:  Re: UR evaluation

Very nice article. And it gets much clearer the importance of treating the
recommendation like a ranking task.
Thanks

Il gio 10 mag 2018, 19:12 Pat Ferrel  ha scritto:

> Here is a discussion of how we used it for tuning with multiple input
> types:
> https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/
>
> We used video likes, dislikes, and video metadata to increase our MAP@k
> by 26% eventually. So this was mainly an exercise in incorporating data.
> Since this research was done we have learned how to better tune this type
> of situation but that’s a long story fit for another blog post.
>
>
> From: Marco Goldin  
> Reply: user@predictionio.apache.org 
> 
> Date: May 10, 2018 at 9:54:23 AM
> To: Pat Ferrel  
> Cc: user@predictionio.apache.org 
> 
> Subject:  Re: UR evaluation
>
> thank you very much, i didn't see this tool, i'll definitely try it.
> Clearly better to have such a specific instrument.
>
>
>
> 2018-05-10 18:36 GMT+02:00 Pat Ferrel :
>
>> You can if you want but we have external tools for the UR that are much
>> more flexible. The UR has tuning that can’t really be covered by the built
>> in API. https://github.com/actionml/ur-analysis-tools They do MAP@k as
>> well as creating a bunch of other metrics and comparing different types of
>> input data. They use a running UR to make queries against.
>>
>>
>> From: Marco Goldin  
>> Reply: user@predictionio.apache.org 
>> 
>> Date: May 10, 2018 at 7:52:39 AM
>> To: user@predictionio.apache.org 
>> 
>> Subject:  UR evaluation
>>
>> hi all, i successfully trained a universal recommender but i don't know
>> how to evaluate the model.
>>
>> Is there a recommended way to do that?
>> I saw that *predictionio-template-recommender* actually has
>> the Evaluation.scala file which uses the class *PrecisionAtK *for the
>> metrics.
>> Should i use this template to implement a similar evaluation for the UR?
>>
>> thanks,
>> Marco Goldin
>> Horizons Unlimited s.r.l.
>>
>>
>

Re: UR evaluation

2018-05-10 Thread Pat Ferrel

Here is a discussion of how we used it for tuning with multiple input types: 
https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/

We used video likes, dislikes, and video metadata to increase our MAP@k by 26% 
eventually. So this was mainly an exercise in incorporating data. Since this 
research was done we have learned how to better tune this type of situation but 
that’s a long story fit for another blog post.


From: Marco Goldin 
Reply: user@predictionio.apache.org 
Date: May 10, 2018 at 9:54:23 AM
To: Pat Ferrel 
Cc: user@predictionio.apache.org 
Subject:  Re: UR evaluation  

thank you very much, i didn't see this tool, i'll definitely try it. Clearly 
better to have such a specific instrument.



2018-05-10 18:36 GMT+02:00 Pat Ferrel :
You can if you want but we have external tools for the UR that are much more 
flexible. The UR has tuning that can’t really be covered by the built in API. 
https://github.com/actionml/ur-analysis-tools They do MAP@k as well as creating 
a bunch of other metrics and comparing different types of input data. They use 
a running UR to make queries against.


From: Marco Goldin 
Reply: user@predictionio.apache.org 
Date: May 10, 2018 at 7:52:39 AM
To: user@predictionio.apache.org 
Subject:  UR evaluation

hi all, i successfully trained a universal recommender but i don't know how to 
evaluate the model.

Is there a recommended way to do that?
I saw that predictionio-template-recommender actually has the Evaluation.scala 
file which uses the class PrecisionAtK for the metrics. 
Should i use this template to implement a similar evaluation for the UR?

thanks,
Marco Goldin
Horizons Unlimited s.r.l.

Re: UR evaluation

2018-05-10 Thread Pat Ferrel

You can if you want but we have external tools for the UR that are much
more flexible. The UR has tuning that can’t really be covered by the built
in API. https://github.com/actionml/ur-analysis-tools They do MAP@k as well
as creating a bunch of other metrics and comparing different types of input
data. They use a running UR to make queries against.


From: Marco Goldin  
Reply: user@predictionio.apache.org 

Date: May 10, 2018 at 7:52:39 AM
To: user@predictionio.apache.org 

Subject:  UR evaluation

hi all, i successfully trained a universal recommender but i don't know how
to evaluate the model.

Is there a recommended way to do that?
I saw that *predictionio-template-recommender* actually has
the Evaluation.scala file which uses the class *PrecisionAtK *for the
metrics.
Should i use this template to implement a similar evaluation for the UR?

thanks,
Marco Goldin
Horizons Unlimited s.r.l.

Re: UR: build/train/deploy once & querying for 3 use cases

2018-05-09 Thread Pat Ferrel

Why do you want to throw away user behavior in making recommendations? The
lift you get in purchases will be less.

There is a use case for this when you are making recommendations basically
inside a session where the user is browsing/viewing things on a hunt for
something. In this case you would want to make recs using the user history
of views but you have to build a model of purchase as the primary indicator
or you won’t get purchase recommendations and believe me recommending views
is a road to bad results. People view many things they do not buy, putting
only view behavior that lead to purchases in the model. So create a model
with purchase as the primary indicator and view as the secondary.

Once you have the model use only the user’s session viewing history in the
as the Elasticsearch query.

This is a feature on our list.


From: gerasimos xydas 

Reply: user@predictionio.apache.org 

Date: May 9, 2018 at 6:20:46 AM
To: user@predictionio.apache.org 

Subject:  UR: build/train/deploy once & querying for 3 use cases

Hello everybody,

We are experimenting with the Universal Recommender to provide
recommendations for the 3 distinct use cases below:

- Get a product recommendation based on product views
- Get a product recommendation based on product purchases
- Get a product recommendation based on previous purchases and views (i.e.
users who viewed this bought that)

The event server is fed from a single app with two types of events: "view"
and "purchase".

1. How should we customize the query to fetch results for each separate
case?
2. Is it feasible to build/train/deploy only once, and query for all 3 use
cases?


Best Regards,
Gerasimos

Users of Scala 2.11

2018-04-24 Thread Pat Ferrel

Hi all,

Mahout has hit a bit of a bump in releasing a Scala 2.11 version. I was
able to build 0.13.0 for Scala 2.11 and have published it on github as a
Maven compatible repo. I’m also using it from SBT.

If anyone wants access let me know.

Users of Scala 2.11

2018-04-24 Thread Pat Ferrel

Hi all,

Mahout has hit a bit of a bump in releasing a Scala 2.11 version. I was
able to build 0.13.0 for Scala 2.11 and have published it on github as a
Maven compatible repo. I’m also using it from SBT.

If anyone wants access let me know.

Re: Info / resources for scaling PIO?

2018-04-24 Thread Pat Ferrel

PIO is based on the architecture of Spark, which uses HDFS. HBase also uses
HDFS. Scaling these are quite well documented on the web. Scaling PIO is
the same as scaling all it’s services. It is unlikely you’ll need it but
you can also have more than one PIO server behind a load balancer.

Don’t use local models, put them in HDFS. Don’t mess with NFS, it is not
the design point for PIO. Scaling Spark beyond one machine will require
HDFS anyway so use it.

I also advise against using ES for all storage. 4 things hit the event
storage, incoming events (input), training, where all events are read out
at high speed, optionally model storage (depending on the engine) and
queries usually hit the event storage. This will quickly overload one
service and ES is not built as an object retrieval DB. The only reason to
use ES for all storage is that it is convenient when doing development or
experimenting with engines. In production it would be risky to rely on ES
for all storage and you would still need to scale out Spark and therefore
HDFS.

There is a little written about various scaling models here:
http://actionml.com/docs/pio_by_actionml the the architecture and workflow
tab and there are a couple system install docs that cover scaling.


From: Adam Drew  
Reply: user@predictionio.apache.org 

Date: April 24, 2018 at 7:37:35 AM
To: user@predictionio.apache.org 

Subject:  Info / resources for scaling PIO?

Hi all!



Is there any info on how to scale PIO to multiple nodes? I’ve gone through
a lot of the docs on the site and haven’t found anything. I’ve tested PIO
running with HBASE and ES for metadata and events, and with using just ES
for both (my preference thusfar) and have my models on local storage. Would
scaling simply be a matter of deploying clustered ES, and then finding some
way to share my model storage, such as NFS or HDFS? The question then is
what (if anything) has to be done for the nodes to “know” about changes on
other nodes. For example, if the model gets trained on node A does node B
automatically know about that?



I hope that makes sense. I’m coming to PIO with no prior experience for the
underlying apache bits (spark, hbase / hdfs, etc) so there’s likely things
I’m not considering. Any help / docs / guidance is appreciated.



Thanks!

Adam

Re: pio deploy without spark context

2018-04-14 Thread Pat Ferrel

The need for Spark at query time depends on the engine. Which are you
using? The Universal Recommender, which I maintain, does not require Spark
for queries but uses PIO. We simply don’t use the Spark context so it is
ignored. To make PIO work you need to have the Spark code accessible but
that doesn’t mean there must be a Spark cluster, you can  set the Spark
master to “local” and there are no Spark resources used in the deployed pio
PredictionServer.

We have infra code to spin up a Spark cluster for training and bring it
back down afterward. This all works just fine. The UR PredictionServer also
has no need to be re-deployed since the model is hot-swapped after
training, Deploy once run forever. And no real requirement for Spark to do
queries.

So depending on the Engine the requirement for Spark is code level not
system level.

From: Donald Szeto  
Reply: user@predictionio.apache.org 

Date: April 13, 2018 at 4:48:15 PM
To: user@predictionio.apache.org 

Subject:  Re: pio deploy without spark context

Hi George,

This is unfortunately not possible now without modifying the source code,
but we are planning to refactor PredictionIO to be runtime-agnostic,
meaning the engine server would be independent and SparkContext would not
be created if not necessary.

We will start a discussion on the refactoring soon. You are very welcome to
add your input then, and any subsequent contribution would be highly
appreciated.

Regards,
Donald

On Fri, Apr 13, 2018 at 3:51 PM George Yarish 
wrote:

> Hi all,
>
> We use pio engine which doesn't require apache spark in serving time, but
> from my understanding anyway sparkContext will be created by "pio deploy"
> process by default.
> My question is there any way to deploy an engine avoiding creation of
> spark application if I don't need it?
>
> Thanks,
> George
>
>

Re: Hbase issue

2018-04-13 Thread Pat Ferrel

This may seem unhelpful now but for others it might be useful to mention some
minimum PIO in production best practices:

1) PIO should IMO never be run in production on a single node. When all
services share the same memory, cpu, and disk, it is very difficult to find the
root cause to a problem.
2) backup data with pio export periodically
3) install monitoring for disk used, as well as response times and other
factors so you get warnings before you get wedged.
4) PIO will store data forever. It is designed as an input only system. Nothing
is dropped ever. This is clearly unworkable in real life so a feature was added
to trim the event stream in a safe way in PIO 0.12.0. There is a separate
Template for trimming the DB and doing other things like deduplication and
other compression on some schedule that can and should be different than
training. Do not use this template until you upgrade and make sure it is
compatible with your template: https://github.com/actionml/db-cleaner

From: bala vivek
Reply: user@predictionio.apache.org
Date: April 13, 2018 at 2:50:26 AM
To: user@predictionio.apache.org
Subject: Re: Hbase issue

Hi Donald,

Yes, I'm running on the single machine. PIO, hbase , elasticsearch, spark
everything works on the same server. Let me know which file I need to remove
because I have client data present in PIO.

I have tried adding the entries in hbase-site.xml using the following link,
after which I can see the Hmaster seems active but still, the error remains the
same.

https://medium.com/@tjosepraveen/cant-get-connection-to-zookeeper-keepererrorcode-connectionloss-for-hbase-63746fbcdbe7

Hbase Error logs :- ( I have commented the server name)

2018-04-13 04:31:28,246 INFO [RS:0;VD500042:49584-SendThread(localhost:2182)]
zookeeper.ClientCnxn: Opening socket connection to server
localhost/0:0:0:0:0:0:0:1:2182. Will not attempt to authenticate using SASL
(unknown error)
2018-04-13 04:31:28,247 WARN [RS:0;XX:49584-SendThread(localhost:2182)]
zookeeper.ClientCnxn: Session 0x162be5554b90003 for server null, unexpected
error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2018-04-13 04:31:28,553 ERROR [main] master.HMasterCommandLine: Master exiting
java.lang.RuntimeException: Master not initialized after 20ms seconds
at
org.apache.hadoop.hbase.util.JVMClusterUtil.startup(JVMClusterUtil.java:225)
at
org.apache.hadoop.hbase.LocalHBaseCluster.startup(LocalHBaseCluster.java:449)
at
org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:225)
at
org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:137)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at
org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2436)
(END)

I have tried multiple time pio-stop-all and pio-start-all but no luck the
service is not up.
If I install the hbase alone in the existing setup let me know what things I
should consider. If anyone faced this issue please provide me the solution
steps.

On Thu, Apr 12, 2018 at 9:13 PM, Donald Szeto wrote:
Hi Bala,

Are you running a single-machine HBase setup? The ZooKeeper embedded in such a
setup is pretty fragile to disk space issue and your ZNode might have corrupted.

If that’s indeed your setup, please take a look at HBase log files,
specifically on messages from ZooKeeper. In this situation, one way to recover
is to remove ZooKeeper files and let HBase recreate them, assuming from your
log output that you don’t have other services depend on the same ZK.

Regards,
Donald

On Thu, Apr 12, 2018 at 5:34 AM bala vivek wrote:
Hi,

I use PIO 0.10.0 version and hbase 1.2.4. The setup was working fine till today
morning. I saw PIO was down as the mount space issue was present on the server
and cleared the unwanted files.

After doing a pio-stop-all and pio-start-all the HMaster service is not
working. I tried multiple times the pio restart.

I can see whenever I do a pio-stop-all and check the service using jps, the
Hmaster seems running. Similarly I tried to run the ./start-hbase.sh script but
still pio status is not showing as success.

pio error log :

[INFO] [Console$] Inspecting PredictionIO...
[INFO] [Console$] PredictionIO 0.10.0-incubating is installed at
/opt/tools/PredictionIO-0.10.0-incubating
[INFO] [Console$] Inspecting Apache Spark...
[INFO] [Console$] Apache Spark is installed at
/opt/tools/PredictionIO-0.10.0-incubating/vendors/spark-1.6.3-b

Re: Data import, HBase requirements, and cost savings ?

2018-04-10 Thread Pat Ferrel

It depends on what templates you are using. For instance the recommenders
require queries to the EventStore to get user history so this will not work
for them. Some templates do not require Spark to be running at scale except
for the training phase (The Universal Recommender for instance) so for that
template it is much more cost-effective to stop Spark when not using it.

Every template uses the PIO framework in different ways. Dropping the DB is
not likely to work, especially if you are using it to store engine metadata.

We’d need to know what templates you are using to advise cost savings.

From: Miller, Clifford 

Reply: user@predictionio.apache.org 

Date: April 10, 2018 at 11:22:04 AM
To: user@predictionio.apache.org 

Subject:  Data import, HBase requirements, and cost savings ?

I'm exploring cost saving options for a customer that is wanting to utilize
PredictionIO.  We plan on running multiple engines/templates.  We are
planning on running everything in AWS and are hoping to not have all data
loaded for all templates at once.  The hope is to:

   1. start up the HBase cluster.
   2. Import the events.
   3. Train the model
   4. then store the model in S3.
   5. Then shutdown HBase cluster

We have some general questions.

   1. Is this approach even feasible?
   2. Does PredictionIO require the Event Store (HBase) to be up and
   running constantly or can we turn it off when not training?  If it requires
   HBase constantly can we do the training from a different HBase cluster and
   then have separate PIO Event/Engine servers to deploy the applications
   using the model generated by the larger Hbase cluster?
   3. Can the events be stored in S3 and then imported in (pio import) when
   needed for training? or will we have to copy them out of S3 to our PIO
   Event/Engine server?
   4. Has any import benchmarks been done?  Events per second or MB/GB per
   second?

Any assistance would be appreciated.

--Cliff.

Re: how to set engine-variant in intellij idea

2018-04-10 Thread Pat Ferrel

There are instructions for using Intellij but, I wrote the last version, I
apologize that I can’t make them work anymore. If you get them to work you
would be doing the community a great service by telling us how or editing
the instructions.

http://predictionio.apache.org/resources/intellij/


From: qi zhang  
Reply: user@predictionio.apache.org 

Date: April 10, 2018 at 1:40:58 AM
To: user@predictionio.apache.org 

Subject:  how to set engine-variant in intellij idea

大家好：
   我用intellij idea部署模型遇到如下问题

请问engine-variant是什么，我在哪里可以得到这个参数的值，能否帮忙举一个例子告诉我怎么设置这个参数
谢谢！非常感谢！


ii_jftezj9g1_162aeb5cfe5db27b
Description: Binary data


ii_jftemvdq0_162aeaccac8ddd04
Description: Binary data

Re: Unclear problem with using S3 as a storage data source

2018-03-29 Thread Pat Ferrel

Ok, the problem, as I thought at first, is that Spark creates the model and the 
PredictionServer must read it.

My methods below still work. There is very little extra to creating a pseudo 
cluster for HDFS as far a performance if it is still running all on one machine.

You can also write it on the Spark/training machine ot localfs and copy it to 
the PredictionServer before deploy. A simple scp in a script would do that.

Again I have no knowledge of using S3 for such things. If that works, someone 
else will have to help.




From: Dave Novelli 
Reply: user@predictionio.apache.org 
Date: March 29, 2018 at 6:19:58 AM
To: Pat Ferrel 
Cc: user@predictionio.apache.org 
Subject:  Re: Unclear problem with using S3 as a storage data source  

Sorry Pat, I think I took some shortcuts in my initial explanation that are 
causing some confusion :) I'll try laying everything out again in detail...

I have configured 2 servers in AWS:

Event/Prediction Server - t2.medium
- Runs permanently
- Using swap to deal with 4GB mem limit (I know, I know)
- ElasticSearch
- HBase (pseudo-distributed mode, using normal files instead of hdfs)
- Web server for events and 6 prediction models

Training Server - r4.large
- Only spun up to execute "pio train" for the 6 UR models I've configured then 
spun back down
- Spark

My specific problem is that running "pio train" on the training server when 
"LOCALFS" is set as the model data store will deposit all the stub files in 
.pio_store/models/.

When I run "pio deploy" on the Event/Prediction Server, it's looking for those 
files in the .pio_store/models/ directory on the Event/Prediction server, and 
they're obviously not there. If I manually copy the files from the Training 
server to the Event/Prediction server then "pio deploy" works as expected.

My thought is that if the Training server saves those model stub files to S3, 
then the Event/Prediction server can read those files from S3 and I won't have 
to manually copy them.


Hopefully this clears my situation up!


As a note - I realize t2.medium is not a feasible instance type for any 
significant production system, but I'm bootstrapping a demo system on a very 
tight budget for a site that will almost certainly have extremely low traffic. 
In my initial tests I've managed to get UR working on this configuration and 
will be doing some simple load testing soon to see how far I can push it before 
it crashes. Speed is obviously not an issue at the moment but once it is (and 
once there's some funding) that t2 will be replaced with an r4 or an m5

Cheers,
Dave


Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | d...@ultravioletanalytics.com

On Wed, Mar 28, 2018 at 7:40 PM, Pat Ferrel  wrote:
Sorry then I don’t understand what part has no access to the file system on the 
single machine? 

Also a t2 is not going to work with PIO. Spark 2 along requires something like 
2g for a do-nothing empty executor and driver, so a real app will require 16g 
or so minimum (my laptop has 16g). Run the OS, HBase, ES, and Spark will get 
you to over 8g, then add data. Spark keeps all data needed at a given phase of 
the calculation in memory across the cluster, that’s where it gets it’s speed. 
Welcome to big-data :-)


From: Dave Novelli 
Reply: user@predictionio.apache.org 
Date: March 28, 2018 at 3:47:35 PM
To: Pat Ferrel 
Cc: user@predictionio.apache.org 
Subject:  Re: Unclear problem with using S3 as a storage data source

I don't *think* I need more spark nodes - I'm just using the one for training 
on an r4.large instance I spin up and down as needed.

I was hoping to avoid adding any additional computational load to my 
Event/Prediction/HBase/ES server (all running on a t2.medium) so I am looking 
for a way to *not* install HDFS on there as well. S3 seemed like it would be a 
super convenient way to pass the model files back and forth, but it sounds like 
it wasn't implemented as a data source for the model repository for UR.

Perhaps that's something I could implement and contribute? I can *kinda* read 
Scala haha, maybe this would be a fun learning project. Do you think it would 
be fairly straightforward?


Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | d...@ultravioletanalytics.com

On Wed, Mar 28, 2018 at 6:01 PM, Pat Ferrel  wrote:
So you need to have more Spark nodes and this is the problem?

If so setup HBase on pseudo-clustered HDFS so you have a master node address 
even though all storage is on one machine. Then you use that version of HDFS to 
tell Spark where to look for the model. It give the model a URI.

I have never used the raw S3 support, HDFS can also be backed by S3 but you use 
HDFS APIs, it is an HDFS config setting to use S3.

It is a rather unfortunate side effect of PIO but th

Re: Unclear problem with using S3 as a storage data source

2018-03-28 Thread Pat Ferrel

Sorry then I don’t understand what part has no access to the file system on the 
single machine? 

Also a t2 is not going to work with PIO. Spark 2 along requires something like 
2g for a do-nothing empty executor and driver, so a real app will require 16g 
or so minimum (my laptop has 16g). Run the OS, HBase, ES, and Spark will get 
you to over 8g, then add data. Spark keeps all data needed at a given phase of 
the calculation in memory across the cluster, that’s where it gets it’s speed. 
Welcome to big-data :-)


From: Dave Novelli 
Reply: user@predictionio.apache.org 
Date: March 28, 2018 at 3:47:35 PM
To: Pat Ferrel 
Cc: user@predictionio.apache.org 
Subject:  Re: Unclear problem with using S3 as a storage data source  

I don't *think* I need more spark nodes - I'm just using the one for training 
on an r4.large instance I spin up and down as needed.

I was hoping to avoid adding any additional computational load to my 
Event/Prediction/HBase/ES server (all running on a t2.medium) so I am looking 
for a way to *not* install HDFS on there as well. S3 seemed like it would be a 
super convenient way to pass the model files back and forth, but it sounds like 
it wasn't implemented as a data source for the model repository for UR.

Perhaps that's something I could implement and contribute? I can *kinda* read 
Scala haha, maybe this would be a fun learning project. Do you think it would 
be fairly straightforward?


Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | d...@ultravioletanalytics.com

On Wed, Mar 28, 2018 at 6:01 PM, Pat Ferrel  wrote:
So you need to have more Spark nodes and this is the problem?

If so setup HBase on pseudo-clustered HDFS so you have a master node address 
even though all storage is on one machine. Then you use that version of HDFS to 
tell Spark where to look for the model. It give the model a URI.

I have never used the raw S3 support, HDFS can also be backed by S3 but you use 
HDFS APIs, it is an HDFS config setting to use S3.

It is a rather unfortunate side effect of PIO but there are 2 ways to solve 
this with no extra servers. 

Maybe someone else knows how to use S3 natively for the model stub?
 

From: Dave Novelli 
Date: March 28, 2018 at 12:13:12 PM
To: Pat Ferrel 
Cc: user@predictionio.apache.org 
Subject:  Re: Unclear problem with using S3 as a storage data source

Well, it looks like the local file system isn't an option in a multi-server 
configuration without manually setting up a process to transfer those stub 
model files.

I trained models on one heavy-weight temporary instance, and then when I went 
to deploy from the prediction server instance it failed due to missing files. I 
copied the .pio_store/models directory from the training server over to the 
prediction server and then was able to deploy.

So, in a dual-instance configuration what's the best way to store the files? 
I'm using pseudo-distributed HBase with standard file system storage instead of 
HDFS (my current aim is keeping down cost and complexity for a pilot project).

Is S3 back on the table as on option?

On Fri, Mar 23, 2018 at 11:03 AM, Dave Novelli  
wrote:
Ahhh ok, thanks Pat!


Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | d...@ultravioletanalytics.com

On Fri, Mar 23, 2018 at 8:08 AM, Pat Ferrel  wrote:
There is no need to have Universal Recommender models put in S3, they are not 
used and only exist (in stub form) because PIO requires them. The actual model 
lives in Elasticsearch and uses special features of ES to perform the last 
phase of the algorithm and so cannot be replaced.

The stub PIO models have no data and will be tiny. putting them in HDFS or the 
local file system is recommended.


From: Dave Novelli 
Reply: user@predictionio.apache.org 
Date: March 22, 2018 at 6:17:32 PM
To: user@predictionio.apache.org 
Subject:  Unclear problem with using S3 as a storage data source

Hi all,

I'm using the Universal Recommender template and I'm trying to switch storage 
data sources from local file to S3 for the model repository. I've read the page 
at https://predictionio.apache.org/system/anotherdatastore/ to try to 
understand the configuration requirements, but when I run pio train it's 
indicating an error and nothing shows up in the s3 bucket: 

[ERROR] [S3Models] Failed to insert a model to 
s3://pio-model/pio_modelAWJPjTYM0wNJe2iKBl0d

I created a new bucket named "pio-model" and granted full public permissions.

Seemingly relevant settings from pio-env.sh:

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=S3
...

PIO_STORAGE_SOURCES_S3_TYPE=s3
PIO_STORAGE_SOURCES_S3_REGION=us-west-2
PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio-model

# I've tried with and without this
#PIO_STORAGE_SOURCES_S3_ENDPOINT=http://s3.us-west-2.amazonaws.com

# I'

Re: Unclear problem with using S3 as a storage data source

2018-03-28 Thread Pat Ferrel

So you need to have more Spark nodes and this is the problem?

If so setup HBase on pseudo-clustered HDFS so you have a master node
address even though all storage is on one machine. Then you use that
version of HDFS to tell Spark where to look for the model. It give the
model a URI.

I have never used the raw S3 support, HDFS can also be backed by S3 but you
use HDFS APIs, it is an HDFS config setting to use S3.

It is a rather unfortunate side effect of PIO but there are 2 ways to solve
this with no extra servers.

Maybe someone else knows how to use S3 natively for the model stub?

From: Dave Novelli 

Date: March 28, 2018 at 12:13:12 PM
To: Pat Ferrel  
Cc: user@predictionio.apache.org 

Subject:  Re: Unclear problem with using S3 as a storage data source

Well, it looks like the local file system isn't an option in a multi-server
configuration without manually setting up a process to transfer those stub
model files.

I trained models on one heavy-weight temporary instance, and then when I
went to deploy from the prediction server instance it failed due to missing
files. I copied the .pio_store/models directory from the training server
over to the prediction server and then was able to deploy.

So, in a dual-instance configuration what's the best way to store the
files? I'm using pseudo-distributed HBase with standard file system storage
instead of HDFS (my current aim is keeping down cost and complexity for a
pilot project).

Is S3 back on the table as on option?

On Fri, Mar 23, 2018 at 11:03 AM, Dave Novelli <
d...@ultravioletanalytics.com> wrote:

> Ahhh ok, thanks Pat!
>
>
> Dave Novelli
> Founder/Principal Consultant, Ultraviolet Analytics
> www.ultravioletanalytics.com | 919.210.0948 <(919)%20210-0948> |
> d...@ultravioletanalytics.com
>
> On Fri, Mar 23, 2018 at 8:08 AM, Pat Ferrel  wrote:
>
>> There is no need to have Universal Recommender models put in S3, they are
>> not used and only exist (in stub form) because PIO requires them. The
>> actual model lives in Elasticsearch and uses special features of ES to
>> perform the last phase of the algorithm and so cannot be replaced.
>>
>> The stub PIO models have no data and will be tiny. putting them in HDFS
>> or the local file system is recommended.
>>
>>
>> From: Dave Novelli 
>> 
>> Reply: user@predictionio.apache.org 
>> 
>> Date: March 22, 2018 at 6:17:32 PM
>> To: user@predictionio.apache.org 
>> 
>> Subject:  Unclear problem with using S3 as a storage data source
>>
>> Hi all,
>>
>> I'm using the Universal Recommender template and I'm trying to switch
>> storage data sources from local file to S3 for the model repository. I've
>> read the page at https://predictionio.apache.org/system/anotherdatastore/
>> to try to understand the configuration requirements, but when I run pio
>> train it's indicating an error and nothing shows up in the s3 bucket:
>>
>> [ERROR] [S3Models] Failed to insert a model to
>> s3://pio-model/pio_modelAWJPjTYM0wNJe2iKBl0d
>>
>> I created a new bucket named "pio-model" and granted full public
>> permissions.
>>
>> Seemingly relevant settings from pio-env.sh:
>>
>> PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
>> PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=S3
>> ...
>>
>> PIO_STORAGE_SOURCES_S3_TYPE=s3
>> PIO_STORAGE_SOURCES_S3_REGION=us-west-2
>> PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio-model
>>
>> # I've tried with and without this
>> #PIO_STORAGE_SOURCES_S3_ENDPOINT=http://s3.us-west-2.amazonaws.com
>>
>> # I've tried with and without this
>> #PIO_STORAGE_SOURCES_S3_BASE_PATH=pio-model
>>
>>
>> Any suggestions where I can start troubleshooting my configuration?
>>
>> Thanks,
>> Dave
>>
>>
>

--
Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | d...@ultravioletanalytics.com

Re: Error when training The Universal Recommender 0.7.0 with PredictionIO 0.12.0-incubating

2018-03-27 Thread Pat Ferrel

Pio build requires that ES hosts are known to Spark, which write the model
to ES. You can pass these in on the `pio train` command line:

pio train … -- --conf spark.es.nodes=“node1,node2,node3”

notice no spaces in the quoted list of hosts, also notice the double dash,
which separates pio parameters from Spark parameters.

There is a way to pass this in using the sparkConf section in engine.json
but this is unreliable due to how the commas are treated in ES. The site
description for the UR in the small HA cluster has not been updated for
0.7.0 because we are expecting a Mahout release, which will greatly simplfy
the build process described in the README.


From: VI, Tran Tan Phong  
Reply: user@predictionio.apache.org 

Date: March 27, 2018 at 3:09:30 AM
To: user@predictionio.apache.org 

Subject:  Error when training The Universal Recommender 0.7.0 with
PredictionIO 0.12.0-incubating

Hi,



I am trying to build and train UR 0.7.0 with PredictionIO 0.12.0-incubating
on a local “Small HA Cluster” (http://actionml.com/docs/small_ha_cluster)
using Elasticsearch 5.5.2.

By following different steps of the how-to, I success to execute the “pio
build” command of U.R 7.0. But I am getting some errors on the following
step of “pio train”.



Here are the principal errors:

…

[INFO] [HttpMethodDirector] I/O exception (java.net.ConnectException)
caught when processing request: Connection refused (Connection refused)

[INFO] [HttpMethodDirector] Retrying request

[INFO] [HttpMethodDirector] I/O exception (java.net.ConnectException)
caught when processing request: Connection refused (Connection refused)

[INFO] [HttpMethodDirector] Retrying request

[INFO] [HttpMethodDirector] I/O exception (java.net.ConnectException)
caught when processing request: Connection refused (Connection refused)

[INFO] [HttpMethodDirector] Retrying request

[ERROR] [NetworkClient] Node [127.0.0.1:9200] failed (Connection refused
(Connection refused)); no other nodes left - aborting...

…



Exception in thread "main"
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES
version - typically this happens if the network/Elasticsearch cluster is
not accessible or when targeting a WAN/Cloud instance without the proper
setting 'es.nodes.wan.only'

…

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException:
Connection error (check network and/or proxy settings)- all nodes failed;
tried [[127.0.0.1:9200]]



The cluster Elasticsearch (aml-elasticsearch) is up, but is not listening
on localhost.



Here under is my config of ES 5.5.2

PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch

PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=aml-elasticsearch

PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=aml-master,aml-slave-1,aml-slave-2

PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9200,9200,9200

PIO_STORAGE_SOURCES_ELASTICSEARCH_SCHEMES=http

PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=/usr/local/elasticsearch



Did somebody get this kind of error before? Any help or suggestion would be
appreciated.



Thanks,

VI Tran Tan Phong
This message contains information that may be privileged or confidential
and is the property of the Capgemini Group. It is intended only for the
person to whom it is addressed. If you are not the intended recipient, you
are not authorized to read, print, retain, copy, disseminate, distribute,
or use this message or any part thereof. If you receive this message in
error, please notify the sender immediately and delete all copies of this
message.

Re: Unclear problem with using S3 as a storage data source

2018-03-23 Thread Pat Ferrel

There is no need to have Universal Recommender models put in S3, they are
not used and only exist (in stub form) because PIO requires them. The
actual model lives in Elasticsearch and uses special features of ES to
perform the last phase of the algorithm and so cannot be replaced.

The stub PIO models have no data and will be tiny. putting them in HDFS or
the local file system is recommended.


From: Dave Novelli 

Reply: user@predictionio.apache.org 

Date: March 22, 2018 at 6:17:32 PM
To: user@predictionio.apache.org 

Subject:  Unclear problem with using S3 as a storage data source

Hi all,

I'm using the Universal Recommender template and I'm trying to switch
storage data sources from local file to S3 for the model repository. I've
read the page at https://predictionio.apache.org/system/anotherdatastore/
to try to understand the configuration requirements, but when I run pio
train it's indicating an error and nothing shows up in the s3 bucket:

[ERROR] [S3Models] Failed to insert a model to
s3://pio-model/pio_modelAWJPjTYM0wNJe2iKBl0d

I created a new bucket named "pio-model" and granted full public
permissions.

Seemingly relevant settings from pio-env.sh:

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=S3
...

PIO_STORAGE_SOURCES_S3_TYPE=s3
PIO_STORAGE_SOURCES_S3_REGION=us-west-2
PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio-model

# I've tried with and without this
#PIO_STORAGE_SOURCES_S3_ENDPOINT=http://s3.us-west-2.amazonaws.com

# I've tried with and without this
#PIO_STORAGE_SOURCES_S3_BASE_PATH=pio-model


Any suggestions where I can start troubleshooting my configuration?

Thanks,
Dave

Re: UR 0.7.0 - problem with training

2018-03-08 Thread Pat Ferrel

BTW I think you may have to push setting on the cli by adding “spark” to
the beginning of the key name:

*pio train -- --conf spark.es.nodes=**“**localhost" --driver-memory 8g
--executor-memory 8g*


From: Pat Ferrel  
Reply: user@predictionio.apache.org 

Date: March 8, 2018 at 11:04:55 AM
To: Wojciech Kowalski  ,
user@predictionio.apache.org 
, u...@predictionio.incubator.apache.org


Subject:  Re: UR 0.7.0 - problem with training

es.nodes is supposed to be a string with hostnames separated by commas.
Depending on how your containers are set to communicate with the outside
world (Docker networking or port mapping) you may also need to set the
port, which is 9200 by default.

If your container is using port mapping and maps the container port 9200 to
the localhost port of 9200 you should be ok with only setting hostnames in
engine.json.

es.nodes=“localhost”

But I suspect you didn’t set your container communication strategy because
this is the fallback that would have been tried with no valid setting.

If this is true look up how you set Docker to communicate, port mapping is
the simplest for a single all-in-one machine.


From: Wojciech Kowalski  
Reply: user@predictionio.apache.org 

Date: March 8, 2018 at 7:31:10 AM
To: u...@predictionio.incubator.apache.org


Subject:  UR 0.7.0 - problem with training

Hello, i am trying to set new UR 0.7.0 with  predictionio 0.12.0 but all
atempts are failing.



I cannot set in engine’s spark config section „es.config” as i get such
error:

org.elasticsearch.index.mapper.MapperParsingException: object mapping for
[sparkConf.es.nodes] tried to parse field [es.nodes] as object, but found a
concrete value



If i don’t set this up engine fail to train because it cannot find
elasticsearch on localhost as it’s running on a separate machine

Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException:
Connection error (check network and/or proxy settings)- all nodes failed;
tried [[localhost:9200]]



Passing es.nodes via cli –conf es.nodes=elasticsearch doesn’t help either :/

*pio train -- --conf es.nodes=elasticsearch --driver-memory 8g
--executor-memory 8g*



Anyone would give any advice what am i doing wrong ?

I have separate docker containers for hadoop,hbase,elasticsearch,pio



Same setup was working fine on 0.10 and UR 0.5



Thanks,

Wojciech Kowalski

Re: UR 0.7.0 - problem with training

2018-03-08 Thread Pat Ferrel

es.nodes is supposed to be a string with hostnames separated by commas.
Depending on how your containers are set to communicate with the outside
world (Docker networking or port mapping) you may also need to set the
port, which is 9200 by default.

If your container is using port mapping and maps the container port 9200 to
the localhost port of 9200 you should be ok with only setting hostnames in
engine.json.

es.nodes=“localhost”

But I suspect you didn’t set your container communication strategy because
this is the fallback that would have been tried with no valid setting.

If this is true look up how you set Docker to communicate, port mapping is
the simplest for a single all-in-one machine.


From: Wojciech Kowalski  
Reply: user@predictionio.apache.org 

Date: March 8, 2018 at 7:31:10 AM
To: u...@predictionio.incubator.apache.org


Subject:  UR 0.7.0 - problem with training

Hello, i am trying to set new UR 0.7.0 with  predictionio 0.12.0 but all
atempts are failing.



I cannot set in engine’s spark config section „es.config” as i get such
error:

org.elasticsearch.index.mapper.MapperParsingException: object mapping for
[sparkConf.es.nodes] tried to parse field [es.nodes] as object, but found a
concrete value



If i don’t set this up engine fail to train because it cannot find
elasticsearch on localhost as it’s running on a separate machine

Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException:
Connection error (check network and/or proxy settings)- all nodes failed;
tried [[localhost:9200]]



Passing es.nodes via cli –conf es.nodes=elasticsearch doesn’t help either :/

*pio train -- --conf es.nodes=elasticsearch --driver-memory 8g
--executor-memory 8g*



Anyone would give any advice what am i doing wrong ?

I have separate docker containers for hadoop,hbase,elasticsearch,pio



Same setup was working fine on 0.10 and UR 0.5



Thanks,

Wojciech Kowalski

Re: Spark 2.x/scala 2.11.x release

2018-03-03 Thread Pat Ferrel

LGTM


From: Andrew Palumbo  
Reply: dev@mahout.apache.org  
Date: March 2, 2018 at 3:38:38 PM
To: dev@mahout.apache.org  
Subject:  Re: Spark 2.x/scala 2.11.x release

I've started a gdoc for a plan.


https://docs.google.com/document/d/1A8aqUORPp83vWa6fSqhC2jxUKEbDWWQ2lzqZ1V0xHS0/edit?usp=sharing


Please add comments, criticision, and alternate plans on on the doc.


--andy


From: Andrew Palumbo 
Sent: Friday, March 2, 2018 6:11:53 PM
To: dev@mahout.apache.org
Subject: Re: Spark 2.x/scala 2.11.x release

Pat, could you explain what you mean by the "Real Problem"? I know that we
have a lot of problems, but in terms of this release, what is the major
blocker?
____
From: Pat Ferrel 
Sent: Friday, March 2, 2018 5:32:58 PM
To: Trevor Grant; dev@mahout.apache.org
Subject: Re: Spark 2.x/scala 2.11.x release

Scopt is so not an issue. None whatsoever. The problem is that drivers have
unmet runtime needs that are different than libs. Scopt has absolutely
nothing to do with this. It was from a false theory that there was no 2.11
version but it actually has 2.11, 2.12, 2.09, and according to D a native
version too.

Get on to the real problems and drop this non-problem. Anything that driver
needs but is not on the classpath will stop them at runtime.

Better to say that we would be closer to release if we dropped drivers.


From: Trevor Grant  
Reply: dev@mahout.apache.org  

Date: March 2, 2018 at 2:26:13 PM
To: Mahout Dev List  
Subject: Re: Spark 2.x/scala 2.11.x release

Supposedly. I hard coded all of the poms to Scala 2.11 (closed PR unmerged)
Pat was still having issues w sbt- but the only dependency that was on 2.10
according to maven was scopt. /shrug



On Mar 2, 2018 4:20 PM, "Andrew Palumbo"  wrote:

> So We could release as is if we can get the scopt issue out? Thats our
> final blocker?
>
> 
> From: Trevor Grant 
> Sent: Friday, March 2, 2018 5:15:35 PM
> To: Mahout Dev List
> Subject: Re: Spark 2.x/scala 2.11.x release
>
> The only "mess" is in the cli spark drivers, namely scopt.
>
> Get rid of the drivers/fix the scopt issue- we have no mess.
>
>
>
> On Mar 2, 2018 4:09 PM, "Pat Ferrel"  wrote:
>
> > BTW the mess master is in is why git flow was invented and why I asked
> that
> > the site be in a new repo so it could be on a separate release cycle.
We
> > perpetuate the mess because it’s always to hard to fix.
> >
> >
> > From: Andrew Palumbo  
> > Reply: dev@mahout.apache.org  <
> > dev@mahout.apache.org>
> > Date: March 2, 2018 at 1:54:51 PM
> > To: dev@mahout.apache.org   >
> > Subject: Re: Spark 2.x/scala 2.11.x release
> >
> > re: reverting master, shit. I forgot that the website is not on
> `asf-site`
> > anymore. Well we could just re-jigger it, and check out `website` from
> > features/multi-artifact-build-MAHOUT-20xx after we revert the rest of
> > master.
> >
> >
> > You're right, Trevor- I 'm just going through the commits, and there
are
> > things like
> > https://github.com/apache/mahout/commit/c17bee3c2705495b638d81ae2ad374
> > bf7494c3f3
> >
> >
> >
> > [https://avatars3.githubusercontent.com/u/5852441?s=200&v=4]<
> > https://github.com/apache/mahout/commit/c17bee3c2705495b638d81ae2ad374
> > bf7494c3f3>
> >
> >
> > MAHOUT-1988 Make Native Solvers Scala 2.11 Complient closes apache/ma…
·
> > apache/mahout@c17bee3<
> > https://github.com/apache/mahout/commit/c17bee3c2705495b638d81ae2ad374
> > bf7494c3f3>
> >
> > github.com
> > …hout#326
> >
> >
> >
> > (make Native Solvers Scala 2.11 compliant) and others peppered in, Post
> > 0.13.0. It still may be possible and not that hard, to cherrypick
> > everything after 0.13.0 that we want. But I see what you're saying
about
> it
> > not being completely simple.
> >
> >
> > As for Git-Flow. I dont really care. I use it in some projects and in
> > others i use GitHub-flow. (basically what we've been doing with merging
> > everything to master).
> >
> >
> > Though this exact problem that we have right now is why git-flow is
nice.
> > Lets separate the question of how we go forward, with what commit/repo
> > style, and First figure out how to back out what we have now, without
> > loosing all of the work that you did on the multi artifact build.
> >
> >
> > What do you think about reverting to 0.13.0, and cherry picking commits
> > like Sparse Speedup:
> > https://github.com/apache/mahout/commit/800a9ed6d7e015aa82b9eb7624bb44
> > 1b71a8f397

Re: Spark 2.x/scala 2.11.x release

2018-03-02 Thread Pat Ferrel

Scopt is so not an issue. None whatsoever. The problem is that drivers have
unmet runtime needs that are different than libs. Scopt has absolutely
nothing to do with this. It was from a false theory that there was no 2.11
version but it actually has 2.11, 2.12, 2.09, and according to D a native
version too.

Get on to the real problems and drop this non-problem. Anything that driver
needs but is not on the classpath will stop them at runtime.

Better to say that we would be closer to release if we dropped drivers.


From: Trevor Grant  
Reply: dev@mahout.apache.org  
Date: March 2, 2018 at 2:26:13 PM
To: Mahout Dev List  
Subject:  Re: Spark 2.x/scala 2.11.x release

Supposedly. I hard coded all of the poms to Scala 2.11 (closed PR unmerged)
Pat was still having issues w sbt- but the only dependency that was on 2.10
according to maven was scopt. /shrug



On Mar 2, 2018 4:20 PM, "Andrew Palumbo"  wrote:

> So We could release as is if we can get the scopt issue out? Thats our
> final blocker?
>
> 
> From: Trevor Grant 
> Sent: Friday, March 2, 2018 5:15:35 PM
> To: Mahout Dev List
> Subject: Re: Spark 2.x/scala 2.11.x release
>
> The only "mess" is in the cli spark drivers, namely scopt.
>
> Get rid of the drivers/fix the scopt issue- we have no mess.
>
>
>
> On Mar 2, 2018 4:09 PM, "Pat Ferrel"  wrote:
>
> > BTW the mess master is in is why git flow was invented and why I asked
> that
> > the site be in a new repo so it could be on a separate release cycle.
We
> > perpetuate the mess because it’s always to hard to fix.
> >
> >
> > From: Andrew Palumbo  
> > Reply: dev@mahout.apache.org  <
> > dev@mahout.apache.org>
> > Date: March 2, 2018 at 1:54:51 PM
> > To: dev@mahout.apache.org   >
> > Subject: Re: Spark 2.x/scala 2.11.x release
> >
> > re: reverting master, shit. I forgot that the website is not on
> `asf-site`
> > anymore. Well we could just re-jigger it, and check out `website` from
> > features/multi-artifact-build-MAHOUT-20xx after we revert the rest of
> > master.
> >
> >
> > You're right, Trevor- I 'm just going through the commits, and there
are
> > things like
> > https://github.com/apache/mahout/commit/c17bee3c2705495b638d81ae2ad374
> > bf7494c3f3
> >
> >
> >
> > [https://avatars3.githubusercontent.com/u/5852441?s=200&v=4]<
> > https://github.com/apache/mahout/commit/c17bee3c2705495b638d81ae2ad374
> > bf7494c3f3>
> >
> >
> > MAHOUT-1988 Make Native Solvers Scala 2.11 Complient closes apache/ma…
·
> > apache/mahout@c17bee3<
> > https://github.com/apache/mahout/commit/c17bee3c2705495b638d81ae2ad374
> > bf7494c3f3>
> >
> > github.com
> > …hout#326
> >
> >
> >
> > (make Native Solvers Scala 2.11 compliant) and others peppered in, Post
> > 0.13.0. It still may be possible and not that hard, to cherrypick
> > everything after 0.13.0 that we want. But I see what you're saying
about
> it
> > not being completely simple.
> >
> >
> > As for Git-Flow. I dont really care. I use it in some projects and in
> > others i use GitHub-flow. (basically what we've been doing with merging
> > everything to master).
> >
> >
> > Though this exact problem that we have right now is why git-flow is
nice.
> > Lets separate the question of how we go forward, with what commit/repo
> > style, and First figure out how to back out what we have now, without
> > loosing all of the work that you did on the multi artifact build.
> >
> >
> > What do you think about reverting to 0.13.0, and cherry picking commits
> > like Sparse Speedup:
> > https://github.com/apache/mahout/commit/800a9ed6d7e015aa82b9eb7624bb44
> > 1b71a8f397
> > or checking out entire folders like `website`?
> >
> > [https://avatars3.githubusercontent.com/u/326731?s=200&v=4]<
> > https://github.com/apache/mahout/commit/800a9ed6d7e015aa82b9eb7624bb44
> > 1b71a8f397>
> >
> >
> > MAHOUT-2019 SparkRow Matrix Speedup and fixing change to scala 2.11 m…
·
> > apache/mahout@800a9ed<
> > https://github.com/apache/mahout/commit/800a9ed6d7e015aa82b9eb7624bb44
> > 1b71a8f397>
> >
> > github.com
> > …ade by build script
> >
> >
> >
> > 
> > From: Trevor Grant 
> > Sent: Friday, March 2, 2018 3:58:07 PM
> > To: Mahout Dev List
> > Subject: Re: Spark 2.x/scala 2.11.x release
> >
> > If you revert master to the release tag you're going to destroy the
> &g

Re: Spark 2.x/scala 2.11.x release

2018-03-02 Thread Pat Ferrel

BTW the mess master is in is why git flow was invented and why I asked that
the site be in a new repo so it could be on a separate release cycle. We
perpetuate the mess because it’s always to hard to fix.

From: Andrew Palumbo  
Reply: dev@mahout.apache.org  
Date: March 2, 2018 at 1:54:51 PM
To: dev@mahout.apache.org  
Subject:  Re: Spark 2.x/scala 2.11.x release

re: reverting master, shit. I forgot that the website is not on `asf-site`
anymore. Well we could just re-jigger it, and check out `website` from
features/multi-artifact-build-MAHOUT-20xx after we revert the rest of
master.

You're right, Trevor- I 'm just going through the commits, and there are
things like
https://github.com/apache/mahout/commit/c17bee3c2705495b638d81ae2ad374bf7494c3f3

[https://avatars3.githubusercontent.com/u/5852441?s=200&v=4]<
https://github.com/apache/mahout/commit/c17bee3c2705495b638d81ae2ad374bf7494c3f3>

MAHOUT-1988 Make Native Solvers Scala 2.11 Complient closes apache/ma… ·
apache/mahout@c17bee3<
https://github.com/apache/mahout/commit/c17bee3c2705495b638d81ae2ad374bf7494c3f3>

github.com
…hout#326

(make Native Solvers Scala 2.11 compliant) and others peppered in, Post
0.13.0. It still may be possible and not that hard, to cherrypick
everything after 0.13.0 that we want. But I see what you're saying about it
not being completely simple.

As for Git-Flow. I dont really care. I use it in some projects and in
others i use GitHub-flow. (basically what we've been doing with merging
everything to master).

Though this exact problem that we have right now is why git-flow is nice.
Lets separate the question of how we go forward, with what commit/repo
style, and First figure out how to back out what we have now, without
loosing all of the work that you did on the multi artifact build.

What do you think about reverting to 0.13.0, and cherry picking commits
like Sparse Speedup:
https://github.com/apache/mahout/commit/800a9ed6d7e015aa82b9eb7624bb441b71a8f397
or checking out entire folders like `website`?

[https://avatars3.githubusercontent.com/u/326731?s=200&v=4]<
https://github.com/apache/mahout/commit/800a9ed6d7e015aa82b9eb7624bb441b71a8f397>

MAHOUT-2019 SparkRow Matrix Speedup and fixing change to scala 2.11 m… ·
apache/mahout@800a9ed<
https://github.com/apache/mahout/commit/800a9ed6d7e015aa82b9eb7624bb441b71a8f397>

github.com
…ade by build script

From: Trevor Grant 
Sent: Friday, March 2, 2018 3:58:07 PM
To: Mahout Dev List
Subject: Re: Spark 2.x/scala 2.11.x release

If you revert master to the release tag you're going to destroy the
website.

The website pulls and rebuilds from mater whenever Jenkins detects a
change.

mahout-0.13.0 has no website. So it will pull nothing and there will be no
site.

tg

On Fri, Mar 2, 2018 at 1:24 PM, Andrew Palumbo  wrote:

>
> Sounds Good. I'll put out a proposal for the release, and we can go Over
> it and vote if we want to on releasing or on the scope. I'm +1 on it.
>
>
> Broad strokes of what I'm thinking:
>
>
> - Checkout a new branch "features/multi-artifact-build-22xx" from master
> @ the `mahout-0.13.0` release tag.
>
>
> - Revert master back to release tag.
>
>
> - Checkout a new `develop` branch from master @the `mahout-0.13.0`
release
> tag.
>
>
> - Cherrypick any commits that we'd like to release (E.g.: SparseSpeedup)
> onto `develop` (along with a PR ad a ticket).
>
>
> - Merge `develop` to `master`, run through Smoke tests, tag master @
> `mahout-0.13.1`(automatically), and release.
>
>
> This will also get us to more of a git-flow workflow, as we've discussed
> moving towards.
>
>
> Thoughts @all?
>
>
> --andy
>
>
>
>
>
>
> 
> From: Pat Ferrel 
> Sent: Wednesday, February 28, 2018 2:53:58 PM
> To: Andrew Palumbo; dev@mahout.apache.org
> Subject: Re: Spark 2.x/scala 2.11.x release
>
> big +1
>
> If you are planning to branch off the 0.13.0 tag let me know, I have a
> speedup that is in my scala 2.11 fork of 0.13.0 that needs to be released
>
>
> From: Andrew Palumbo <mailto:ap@outlook.com>
> Reply: dev@mahout.apache.org<mailto:dev@mahout.apache.org> <
> dev@mahout.apache.org><mailto:dev@mahout.apache.org>
> Date: February 28, 2018 at 11:16:12 AM
> To: dev@mahout.apache.org<mailto:dev@mahout.apache.org> <
> dev@mahout.apache.org><mailto:dev@mahout.apache.org>
> Subject: Spark 2.x/scala 2.11.x release
>
> After some offline discussion regarding people's needs for Spark and 2.x
> and Scala 2.11.x, I am wondering If we should just consider a release for
> 2.x and 2.11.x as the default. We could release from the current master,
or
> branch back off of the 0.13.0 tag, and release that with the upgraded
> defaults, and branch our current multi-artifact build off as a feature.
Any
> thoughts on this?
>
>
> --andy
>

Re: Spark 2.x/scala 2.11.x release

2018-03-02 Thread Pat Ferrel

Trevor said -1 to git flow.

I agree this is a good example of one nice feature. Master is always
pristine.

From: Andrew Palumbo  
Reply: dev@mahout.apache.org  
Date: March 2, 2018 at 12:30:09 PM
To: dev@mahout.apache.org  
Subject:  Re: Spark 2.x/scala 2.11.x release

Pat - Not sure if you thought I meant -1 or you're saying -1 to `git-flow`.
I'm +1 on it, and this is a perfect time to make the transition, since we
don't have a `develop` branch, now we can revert `master` to its state at
the 0.13.0 release, create a `develop` at the same time, branched from the
`mahout-0.13.0` tag, cherrypick commits that we want from post -
`mahout-0.13.0` and keep the rest of the work, currently on `master` e.g.:
all of the work that Trevor did on the multi-artifact release on a feature
branch, which we can hopefully release soon as 0.13.2.

So I'm essentially suggesting that we take this as an opportunity to move
to this `git-flow` style of managing our repo.

____________
From: Pat Ferrel 
Sent: Friday, March 2, 2018 3:00:15 PM
To: Trevor Grant; dev@mahout.apache.org
Subject: Re: Spark 2.x/scala 2.11.x release

-1 on git flow?

What are your reasons? I use it in 3-4 project, one of which is Apache PIO.
I’d think this would be a good example of why it’s nice. right now our
master is screwed up, with git flow it would still have 0.13.0, which is
what we are talking about releasing with minor mods.

From: Trevor Grant  
Reply: dev@mahout.apache.org  

Date: March 2, 2018 at 11:36:52 AM
To: Mahout Dev List  
Subject: Re: Spark 2.x/scala 2.11.x release

I'm +1 for a working scala-2.11, spark-2.x build.

I'm -1 for previously stated reasons on git-flow.

tg

On Fri, Mar 2, 2018 at 1:24 PM, Andrew Palumbo  wrote:

>
> Sounds Good. I'll put out a proposal for the release, and we can go Over
> it and vote if we want to on releasing or on the scope. I'm +1 on it.
>
>
> Broad strokes of what I'm thinking:
>
>
> - Checkout a new branch "features/multi-artifact-build-22xx" from master
> @ the `mahout-0.13.0` release tag.
>
>
> - Revert master back to release tag.
>
>
> - Checkout a new `develop` branch from master @the `mahout-0.13.0`
release
> tag.
>
>
> - Cherrypick any commits that we'd like to release (E.g.: SparseSpeedup)
> onto `develop` (along with a PR ad a ticket).
>
>
> - Merge `develop` to `master`, run through Smoke tests, tag master @
> `mahout-0.13.1`(automatically), and release.
>
>
> This will also get us to more of a git-flow workflow, as we've discussed
> moving towards.
>
>
> Thoughts @all?
>
>
> --andy
>
>
>
>
>
>
> 
> From: Pat Ferrel 
> Sent: Wednesday, February 28, 2018 2:53:58 PM
> To: Andrew Palumbo; dev@mahout.apache.org
> Subject: Re: Spark 2.x/scala 2.11.x release
>
> big +1
>
> If you are planning to branch off the 0.13.0 tag let me know, I have a
> speedup that is in my scala 2.11 fork of 0.13.0 that needs to be released
>
>
> From: Andrew Palumbo <mailto:ap@outlook.com>
> Reply: dev@mahout.apache.org<mailto:dev@mahout.apache.org> <
> dev@mahout.apache.org><mailto:dev@mahout.apache.org>
> Date: February 28, 2018 at 11:16:12 AM
> To: dev@mahout.apache.org<mailto:dev@mahout.apache.org> <
> dev@mahout.apache.org><mailto:dev@mahout.apache.org>
> Subject: Spark 2.x/scala 2.11.x release
>
> After some offline discussion regarding people's needs for Spark and 2.x
> and Scala 2.11.x, I am wondering If we should just consider a release for
> 2.x and 2.11.x as the default. We could release from the current master,
or
> branch back off of the 0.13.0 tag, and release that with the upgraded
> defaults, and branch our current multi-artifact build off as a feature.
Any
> thoughts on this?
>
>
> --andy
>

Re: Spark 2.x/scala 2.11.x release

2018-03-02 Thread Pat Ferrel

-1 on git flow?

What are your reasons? I use it in 3-4 project, one of which is Apache PIO.
I’d think this would be a good example of why it’s nice. right now our
master is screwed up, with git flow it would still have 0.13.0, which is
what we are talking about releasing with minor mods.


From: Trevor Grant  
Reply: dev@mahout.apache.org  
Date: March 2, 2018 at 11:36:52 AM
To: Mahout Dev List  
Subject:  Re: Spark 2.x/scala 2.11.x release

I'm +1 for a working scala-2.11, spark-2.x build.

I'm -1 for previously stated reasons on git-flow.

tg


On Fri, Mar 2, 2018 at 1:24 PM, Andrew Palumbo  wrote:

>
> Sounds Good. I'll put out a proposal for the release, and we can go Over
> it and vote if we want to on releasing or on the scope. I'm +1 on it.
>
>
> Broad strokes of what I'm thinking:
>
>
> - Checkout a new branch "features/multi-artifact-build-22xx" from master
> @ the `mahout-0.13.0` release tag.
>
>
> - Revert master back to release tag.
>
>
> - Checkout a new `develop` branch from master @the `mahout-0.13.0`
release
> tag.
>
>
> - Cherrypick any commits that we'd like to release (E.g.: SparseSpeedup)
> onto `develop` (along with a PR ad a ticket).
>
>
> - Merge `develop` to `master`, run through Smoke tests, tag master @
> `mahout-0.13.1`(automatically), and release.
>
>
> This will also get us to more of a git-flow workflow, as we've discussed
> moving towards.
>
>
> Thoughts @all?
>
>
> --andy
>
>
>
>
>
>
> 
> From: Pat Ferrel 
> Sent: Wednesday, February 28, 2018 2:53:58 PM
> To: Andrew Palumbo; dev@mahout.apache.org
> Subject: Re: Spark 2.x/scala 2.11.x release
>
> big +1
>
> If you are planning to branch off the 0.13.0 tag let me know, I have a
> speedup that is in my scala 2.11 fork of 0.13.0 that needs to be released
>
>
> From: Andrew Palumbo <mailto:ap@outlook.com>
> Reply: dev@mahout.apache.org<mailto:dev@mahout.apache.org> <
> dev@mahout.apache.org><mailto:dev@mahout.apache.org>
> Date: February 28, 2018 at 11:16:12 AM
> To: dev@mahout.apache.org<mailto:dev@mahout.apache.org> <
> dev@mahout.apache.org><mailto:dev@mahout.apache.org>
> Subject: Spark 2.x/scala 2.11.x release
>
> After some offline discussion regarding people's needs for Spark and 2.x
> and Scala 2.11.x, I am wondering If we should just consider a release for
> 2.x and 2.11.x as the default. We could release from the current master,
or
> branch back off of the 0.13.0 tag, and release that with the upgraded
> defaults, and branch our current multi-artifact build off as a feature.
Any
> thoughts on this?
>
>
> --andy
>

Re: Spark 2.x/scala 2.11.x release

2018-02-28 Thread Pat Ferrel

big +1

If you are planning to branch off the 0.13.0 tag let me know, I have a
speedup that is in my scala 2.11 fork of 0.13.0 that needs to be released

From: Andrew Palumbo  
Reply: dev@mahout.apache.org  
Date: February 28, 2018 at 11:16:12 AM
To: dev@mahout.apache.org  
Subject:  Spark 2.x/scala 2.11.x release

After some offline discussion regarding people's needs for Spark and 2.x
and Scala 2.11.x, I am wondering If we should just consider a release for
2.x and 2.11.x as the default. We could release from the current master, or
branch back off of the 0.13.0 tag, and release that with the upgraded
defaults, and branch our current multi-artifact build off as a feature. Any
thoughts on this?

--andy

UR 0.7.0

2018-02-16 Thread Pat Ferrel

For any user of the Universal Recommender, version 0.7.0, for PIO 0.12.0 and 
Elasticsearch 5.x the build.sbt had an error which caused a failure during `pio 
build`. This is now fixed in the 0.7.0 tag and master branch. If you have 
already built Mahout from the ActionML fork, you should be able to pull the UR 
repo and `pio build` will now work. Otherwise follow instructions in the 
README.md here: https://github.com/actionml/universal-recommender/tree/0.7.0 


Sorry for the inconvenience.

Thanks to Dave Novelli for his persistence.

Re: Dynamically change parameter list

2018-02-15 Thread Pat Ferrel

There are several things to consider here. One is that the next time you train 
the metadata will be re-written from engine.json. This used to happen when you 
`pio build` but i think it was moved to train. In any case if you don’t need it 
as input to training it should be a part of the model, right? The model is read 
during the predict phase and always re-written by train.

BTW I don’t use the PIO cross-validation stuff because it is too restrictive 
for something that may be used for hyper-parameter search. I have external 
python that drives PIO and collects cross-validation results iteratively.  

On Feb 15, 2018, at 3:32 AM, Tihomir Lolić  wrote:

Hi Pat,

just wanted to follow up on this. I've modified CoreWorkflow to be able to 
store alogrithmParams in the engineInstance.

val engineInstances = Storage.getMetaDataEngineInstances
engineInstances.update(engineInstance.copy(
  status = "COMPLETED",
  endTime = DateTime.now,
  algorithmsParams = if(models(0).isInstanceOf[CustomCrossValidatorModel]) 
JsonExtractor.paramsToJson(workflowConfig.jsonExtractor, algorithmParamsList) 
else engineInstance.algorithmsParams
))
Because I am using CrossValidator I had to extend it with one additional 
parameter which I wanted to save during train of the model.

I don't need this saved data during retraining but only during prediction. In 
case I need them during retraining I would modify TrainApp in a way to fetch 
the data before starting the train and this would solve the problem in case of 
reinforcement.

Hope this would help someone who needs such scenarios.

Best,
Tihomir

On Tue, Feb 13, 2018 at 12:35 AM, Pat Ferrel mailto:p...@occamsmachete.com>> wrote:
That would be fine since the model can contain anything. But the real question 
is where you want to use those params. If you need to use them the next time 
you train, you’ll have to persist them to a place read during training. That is 
usually only the metadata store (obviously input events too), which has the 
contents of engine.json. So to get them into the metadata store you may have to 
alter engine.json. 

Unless someone else knows how to alter the metadata directly after `pio train`

One problem is that you will never know what the new params are without putting 
them in a file or logging them. We keep them in a separate place and merge them 
with engine.json explicitly so we can see what is happening. They are 
calculated parameters, not hand made tunings. It seems important to me to keep 
those separate unless you are talking about some type of expected reinforcement 
learning, not really params but an evolving model.

On Feb 12, 2018, at 2:48 PM, Tihomir Lolić mailto:tihomir.lo...@gmail.com>> wrote:

Thank you very much for the answer. I'll try with customizing workflow. There 
is a step where Seq of models is returned. My idea is to return model and model 
parameters in this step. I'll let you know if it works.

Thanks,
Tihomie

On Feb 12, 2018 23:34, "Pat Ferrel" mailto:p...@occamsmachete.com>> wrote:
This is an interesting question. As we make more mature full featured engines 
they will begin to employ hyper parameter search techniques or reinforcement 
params. This means that there is a new stage in the workflow or a feedback loop 
not already accounted for.

Short answer is no, unless you want to re-write your engine.json after every 
train and probably keep the old one for safety. You must re-train to get the 
new params put into the metastore and therefor available to your engine.

What we do for the Universal Recommender is have a special new workflow phase, 
call it a self-tuning phase, where we search for the right tuning of 
parameters. This it done with code that runs outside of pio and creates 
parameters that go into the engine.json. This can be done periodically to make 
sure the tuning is still optimal.

Not sure whether feedback or hyper parameter search is the best architecture 
for you.

From: Tihomir Lolić  <mailto:tihomir.lo...@gmail.com>
Reply: user@predictionio.apache.org <mailto:user@predictionio.apache.org> 
 <mailto:user@predictionio.apache.org>
Date: February 12, 2018 at 2:02:48 PM
To: user@predictionio.apache.org <mailto:user@predictionio.apache.org> 
 <mailto:user@predictionio.apache.org>
Subject:  Dynamically change parameter list 

> Hi,
> 
> I am trying to figure out how to dynamically update algorithm parameter list. 
> After the train is finished only model is updated. The reason why I need this 
> data to be updated is that I am creating data mapping based on the training 
> data. Is there a way to update this data after the train is done?
> 
> Here is the code that I am using. The variable that and should be updated 
> after the train is marked bold red.
> 
> import io.prediction.controller.{EmptyParams, EngineParams}
> import io.prediction.data.sto

Re: Dynamically change parameter list

2018-02-12 Thread Pat Ferrel

That would be fine since the model can contain anything. But the real question 
is where you want to use those params. If you need to use them the next time 
you train, you’ll have to persist them to a place read during training. That is 
usually only the metadata store (obviously input events too), which has the 
contents of engine.json. So to get them into the metadata store you may have to 
alter engine.json. 

Unless someone else knows how to alter the metadata directly after `pio train`

One problem is that you will never know what the new params are without putting 
them in a file or logging them. We keep them in a separate place and merge them 
with engine.json explicitly so we can see what is happening. They are 
calculated parameters, not hand made tunings. It seems important to me to keep 
those separate unless you are talking about some type of expected reinforcement 
learning, not really params but an evolving model.

On Feb 12, 2018, at 2:48 PM, Tihomir Lolić  wrote:

Thank you very much for the answer. I'll try with customizing workflow. There 
is a step where Seq of models is returned. My idea is to return model and model 
parameters in this step. I'll let you know if it works.

Thanks,
Tihomie

On Feb 12, 2018 23:34, "Pat Ferrel" mailto:p...@occamsmachete.com>> wrote:
This is an interesting question. As we make more mature full featured engines 
they will begin to employ hyper parameter search techniques or reinforcement 
params. This means that there is a new stage in the workflow or a feedback loop 
not already accounted for.

Short answer is no, unless you want to re-write your engine.json after every 
train and probably keep the old one for safety. You must re-train to get the 
new params put into the metastore and therefor available to your engine.

What we do for the Universal Recommender is have a special new workflow phase, 
call it a self-tuning phase, where we search for the right tuning of 
parameters. This it done with code that runs outside of pio and creates 
parameters that go into the engine.json. This can be done periodically to make 
sure the tuning is still optimal.

Not sure whether feedback or hyper parameter search is the best architecture 
for you.

From: Tihomir Lolić  <mailto:tihomir.lo...@gmail.com>
Reply: user@predictionio.apache.org <mailto:user@predictionio.apache.org> 
 <mailto:user@predictionio.apache.org>
Date: February 12, 2018 at 2:02:48 PM
To: user@predictionio.apache.org <mailto:user@predictionio.apache.org> 
 <mailto:user@predictionio.apache.org>
Subject:  Dynamically change parameter list 

> Hi,
> 
> I am trying to figure out how to dynamically update algorithm parameter list. 
> After the train is finished only model is updated. The reason why I need this 
> data to be updated is that I am creating data mapping based on the training 
> data. Is there a way to update this data after the train is done?
> 
> Here is the code that I am using. The variable that and should be updated 
> after the train is marked bold red.
> 
> import io.prediction.controller.{EmptyParams, EngineParams}
> import io.prediction.data.storage.EngineInstance
> import io.prediction.workflow.CreateWorkflow.WorkflowConfig
> import io.prediction.workflow._
> import org.apache.spark.ml.linalg.SparseVector
> import org.joda.time.DateTime
> import org.json4s.JsonAST._
> 
> import scala.collection.mutable
> 
> object TrainApp extends App {
> 
>   val envs = Map("FOO" -> "BAR")
> 
>   val sparkEnv = Map("spark.master" -> "local")
> 
>   val sparkConf = Map("spark.executor.extraClassPath" -> ".")
> 
>   val engineFactoryName = "LogisticRegressionEngine"
> 
>   val workflowConfig = WorkflowConfig(
> engineId = EngineConfig.engineId,
> engineVersion = EngineConfig.engineVersion,
> engineVariant = EngineConfig.engineVariantId,
> engineFactory = engineFactoryName
>   )
> 
>   val workflowParams = WorkflowParams(
> verbose = workflowConfig.verbosity,
> skipSanityCheck = workflowConfig.skipSanityCheck,
> stopAfterRead = workflowConfig.stopAfterRead,
> stopAfterPrepare = workflowConfig.stopAfterPrepare,
> sparkEnv = WorkflowParams().sparkEnv ++ sparkEnv
>   )
> 
>   WorkflowUtils.modifyLogging(workflowConfig.verbose)
> 
>   val dataSourceParams = DataSourceParams(sys.env.get("APP_NAME").get)
>   val preparatorParams = EmptyParams()
> 
>   val algorithmParamsList = Seq("Logistic" -> LogisticParams(columns = 
> Array[String](),
>   dataMapping = 
> Map[String, Map[String, SparseVector]]()))
>   val servingParams = EmptyParams()
> 
>   val engineInstance = EngineInstance(
>

Re: Dynamically change parameter list

2018-02-12 Thread Pat Ferrel

This is an interesting question. As we make more mature full featured
engines they will begin to employ hyper parameter search techniques or
reinforcement params. This means that there is a new stage in the workflow
or a feedback loop not already accounted for.

Short answer is no, unless you want to re-write your engine.json after
every train and probably keep the old one for safety. You must re-train to
get the new params put into the metastore and therefor available to your
engine.

What we do for the Universal Recommender is have a special new workflow
phase, call it a self-tuning phase, where we search for the right tuning of
parameters. This it done with code that runs outside of pio and creates
parameters that go into the engine.json. This can be done periodically to
make sure the tuning is still optimal.

Not sure whether feedback or hyper parameter search is the best
architecture for you.


From: Tihomir Lolić  
Reply: user@predictionio.apache.org 

Date: February 12, 2018 at 2:02:48 PM
To: user@predictionio.apache.org 

Subject:  Dynamically change parameter list

Hi,

I am trying to figure out how to dynamically update algorithm parameter
list. After the train is finished only model is updated. The reason why I
need this data to be updated is that I am creating data mapping based on
the training data. Is there a way to update this data after the train is
done?

Here is the code that I am using. The variable that and should be updated
after the train is marked *bold red.*

import io.prediction.controller.{EmptyParams, EngineParams}
import io.prediction.data.storage.EngineInstance
import io.prediction.workflow.CreateWorkflow.WorkflowConfig
import io.prediction.workflow._
import org.apache.spark.ml.linalg.SparseVector
import org.joda.time.DateTime
import org.json4s.JsonAST._

import scala.collection.mutable

object TrainApp extends App {

  val envs = Map("FOO" -> "BAR")

  val sparkEnv = Map("spark.master" -> "local")

  val sparkConf = Map("spark.executor.extraClassPath" -> ".")

  val engineFactoryName = "LogisticRegressionEngine"

  val workflowConfig = WorkflowConfig(
engineId = EngineConfig.engineId,
engineVersion = EngineConfig.engineVersion,
engineVariant = EngineConfig.engineVariantId,
engineFactory = engineFactoryName
  )

  val workflowParams = WorkflowParams(
verbose = workflowConfig.verbosity,
skipSanityCheck = workflowConfig.skipSanityCheck,
stopAfterRead = workflowConfig.stopAfterRead,
stopAfterPrepare = workflowConfig.stopAfterPrepare,
sparkEnv = WorkflowParams().sparkEnv ++ sparkEnv
  )

  WorkflowUtils.modifyLogging(workflowConfig.verbose)

  val dataSourceParams = DataSourceParams(sys.env.get("APP_NAME").get)
  val preparatorParams = EmptyParams()

  *val algorithmParamsList = Seq("Logistic" -> LogisticParams(columns =
Array[String](),*
*  dataMapping
= Map[String, Map[String, SparseVector]]()))*
  val servingParams = EmptyParams()

  val engineInstance = EngineInstance(
id = "",
status = "INIT",
startTime = DateTime.now,
endTime = DateTime.now,
engineId = workflowConfig.engineId,
engineVersion = workflowConfig.engineVersion,
engineVariant = workflowConfig.engineVariant,
engineFactory = workflowConfig.engineFactory,
batch = workflowConfig.batch,
env = envs,
sparkConf = sparkConf,
dataSourceParams =
JsonExtractor.paramToJson(workflowConfig.jsonExtractor,
workflowConfig.engineParamsKey -> dataSourceParams),
preparatorParams =
JsonExtractor.paramToJson(workflowConfig.jsonExtractor,
workflowConfig.engineParamsKey -> preparatorParams),
algorithmsParams =
JsonExtractor.paramsToJson(workflowConfig.jsonExtractor,
algorithmParamsList),
servingParams = JsonExtractor.paramToJson(workflowConfig.jsonExtractor,
workflowConfig.engineParamsKey -> servingParams)
  )

  val (engineLanguage, engineFactory) =
WorkflowUtils.getEngine(engineInstance.engineFactory,
getClass.getClassLoader)

  val engine = engineFactory()

  val engineParams = EngineParams(
dataSourceParams = dataSourceParams,
preparatorParams = preparatorParams,
algorithmParamsList = algorithmParamsList,
servingParams = servingParams
  )

  val engineInstanceId = CreateServer.engineInstances.insert(engineInstance)

  CoreWorkflow.runTrain(
env = envs,
params = workflowParams,
engine = engine,
engineParams = engineParams,
engineInstance = engineInstance.copy(id = engineInstanceId)
  )

  CreateServer.actorSystem.shutdown()
}


Thank you,
Tihomir

Re: pio train on Amazon EMR

2018-02-05 Thread Pat Ferrel

I agree, we looked at using EMR and found that we liked some custom Terraform +
Docker much better. The existing EMR defined by AWS requires refactoring PIO or
using it in yarn’s cluster mode. EMR is not meant to host any application code
except what is sent into Spark in serialized form. However PIO expects to run
the Spark “Driver” in the PIO process, which means on the PIO server machine.

It is possible to make PIO use yarn’s cluster mode to serialize the “Driver”
too but this is fairly complicated. I think I’ve seen Donald explain it before
but we chose not to do this. For one thing optimizing and tuning yarn managed
Spark changes the meaning of some tuning parameters.

Spark is moving to Kubernetes as a replacement for Yarn so we are quite
interested in following that line of development.

One last thought on EMR: It was designed originally for Hadoop’s MapReduce.
That meant that for a long time you couldn’t get big memory machines in EMR
(you can now). So the EMR team in AWS does not seem to target Spark or other
clustered services as well as they could. This is another reason we decided it
wasn’t worth the trouble.

From: Mars Hall
Reply: user@predictionio.apache.org
Date: February 5, 2018 at 11:45:46 AM
To: user@predictionio.apache.org
Subject: Re: pio train on Amazon EMR

Hi Malik,

This is a topic I've been investigating as well.

Given how EMR manages its clusters & their runtime, I don't think hacking
configs to make the PredictionIO host act like a cluster member will be a
simple or sustainable approach.

PredictionIO already operates Spark by building `spark-submit` commands.
https://github.com/apache/predictionio/blob/df406bf92463da4a79c8d84ec0ca439feaa0ec7f/tools/src/main/scala/org/apache/predictionio/tools/Runner.scala#L313

Implementing a new AWS EMR command runner in PredictionIO, so that we can
switch `pio train` from the existing, plain `spark-submit` command to using the
AWS CLI, `aws emr add-steps --steps Args=spark-submit` would likely solve a big
part of this problem.
https://docs.aws.amazon.com/cli/latest/reference/emr/add-steps.html

Also, uploading the engine assembly JARs (the job code to run on Spark) to the
cluster members or S3 for access from the EMR Spark runtime will be another
part of this challenge.

On Mon, Feb 5, 2018 at 5:29 AM, Malik Twain wrote:
I'm trying to run pio train with Amazon EMR. I copied core-site.xml and
yarn-site.xml from EMR to my training machine, and configured HADOOP_CONF_DIR
in pio-env.sh accordingly.

I'm running pio train as below:

pio train -- --master yarn --deploy-mode cluster

It's failing with the following errors:

18/02/05 11:56:15 INFO Client:
client token: N/A
diagnostics: Application application_1517819705059_0007 failed 2 times due
to AM Container for appattempt_1517819705059_0007_02 exited with exitCode:
1
Diagnostics: Exception from container-launch.

And below are the errors from EMR stdout and stderr respectively:

java.io.FileNotFoundException: /root/pio.log (Permission denied)
[ERROR] [CreateWorkflow$] Error reading from file: File
file:/quickstartapp/MyExample/engine.json does not exist. Aborting workflow.

Thank you.

--
*Mars Hall
415-818-7039
Customer Facing Architect
Salesforce Platform / Heroku
San Francisco, California

Re: Frequent Pattern Mining - No engine found. Your build might have failed. Aborting.

2018-02-01 Thread Pat Ferrel

This list is for support of ActionML products, not general PIO support. You can 
get that on the Apache PIO user mailing list, where I have forwarded this 
question.

Several uses of FPM are supported by the Universal Recommender, such as 
Shopping cart recommendations. That is a template we support.


From: dee...@infosoftni.com 
Date: February 1, 2018 at 2:51:01 AM
To: actionml-user 
Subject:  Frequent Pattern Mining - No engine found. Your build might have 
failed. Aborting.  

I am using Frequent pattern mining template and got following error. No engine 
found. 

Please advice. 


s5@AMOL-PATIL:~/Documents/DataSheet/Templates/pio-template-fpm$ pio build 
--verbose
[INFO] [Engine$] Using command 
'/home/s5/Documents/DataSheet/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/sbt/sbt'
 at /home/s5/Documents/DataSheet/Templates/pio-template-fpm to build.
[INFO] [Engine$] If the path above is incorrect, this process will fail.
[INFO] [Engine$] Uber JAR disabled. Making sure 
lib/pio-assembly-0.12.0-incubating.jar is absent.
[INFO] [Engine$] Going to run: 
/home/s5/Documents/DataSheet/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/sbt/sbt
  package assemblyPackageDependency in 
/home/s5/Documents/DataSheet/Templates/pio-template-fpm
[INFO] [Engine$] [info] Loading project definition from 
/home/s5/Documents/DataSheet/Templates/pio-template-fpm/project
[INFO] [Engine$] [info] Set current project to pio-template-text-clustering (in 
build file:/home/s5/Documents/DataSheet/Templates/pio-template-fpm/)
[INFO] [Engine$] [success] Total time: 1 s, completed 1 Feb, 2018 4:13:41 PM
[INFO] [Engine$] [info] Including from cache: scala-library.jar
[INFO] [Engine$] [info] Checking every *.class/*.jar file's SHA-1.
[INFO] [Engine$] [info] Merging files...
[INFO] [Engine$] [warn] Merging 'META-INF/MANIFEST.MF' with strategy 'discard'
[INFO] [Engine$] [warn] Strategy 'discard' was applied to a file
[INFO] [Engine$] [info] Assembly up to date: 
/home/s5/Documents/DataSheet/Templates/pio-template-fpm/target/scala-2.10/pio-template-text-clustering-assembly-0.1-SNAPSHOT-deps.jar
[INFO] [Engine$] [success] Total time: 1 s, completed 1 Feb, 2018 4:13:42 PM
[INFO] [Engine$] Compilation finished successfully.
[INFO] [Engine$] Looking for an engine...
[ERROR] [Engine$] No engine found. Your build might have failed. Aborting.
s5@AMOL-PATIL:~/Documents/DataSheet/Templates/pio-template-fpm$


--
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to actionml-user+unsubscr...@googlegroups.com.
To post to this group, send email to actionml-u...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/actionml-user/f193dd54-85a7-4598-88fe-fb7c74644f11%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Using Dataframe API vs. RDD API?

2018-01-30 Thread Pat Ferrel

What template are you using? If it is one of the templates in the Apache repos, 
you may want to file a bug report. If PIO supports Spark 2.x, the Apache 
Templates should also IMHO.

From: Daniel O' Shaughnessy 
Reply: user@predictionio.apache.org 
Date: January 30, 2018 at 9:09:49 AM
To: user@predictionio.apache.org 
Subject:  Re: Using Dataframe API vs. RDD API?  

Hi Shane,

You need to use PAlgorithm instead of P2Algorithm and save/load the spark 
context accordingly. This way you can use spark context in the predict function.

There are examples of using PAlgorithm on the predictionio Site. It’s slightly 
more complicated but not too bad!

On Tue, 30 Jan 2018 at 17:06, Shane Johnson  
wrote:
Thanks team! We are close to having our models working with the Dataframe API. 
One additional roadblock we are hitting is the fundamental difference in the 
RDD based API vs the Dataframe API. It seems that the old mllib API would allow 
a simple vector to get predictions where in the new ml API a dataframe is 
required. This presents a challenge as the predict function in PredictionIO 
does not have a spark context. 

Any ideas how to overcome this? Am I thinking through this correctly or are 
there other ways to get predictions with the new ml Dataframe API without 
having a dataframe as input?

Best,

Shane

Shane Johnson | 801.360.3350

LinkedIn | Facebook

2018-01-08 20:37 GMT-10:00 Donald Szeto :
We do have work-in-progress for DataFrame API tracked at 
https://issues.apache.org/jira/browse/PIO-71.

Chan, it would be nice if you could create a branch on your personal fork if 
you want to hand it off to someone else. Thanks!

On Fri, Jan 5, 2018 at 2:02 PM, Pat Ferrel  wrote:
Yes and I do not recommend that because the EventServer schema is not a 
developer contract. It may change at any time. Use the conversion method and go 
through the PIO API to get the RDD then convert to DF for now.

I’m not sure what PIO uses to get an RDD from Postgres but if they do not use 
something like the lib you mention, a PR would be nice. Also if you have an 
interest in adding the DF APIs to the EventServer contributions are encouraged. 
Committers will give some guidance I’m sure—once that know more than me on the 
subject.

If you want to donate some DF code, create a Jira and we’ll easily find a 
mentor to make suggestions. There are many benefits to this including not 
having to support a fork of PIO through subsequent versions. Also others are 
interested in this too.

On Jan 5, 2018, at 7:39 AM, Daniel O' Shaughnessy  
wrote:

Should have mentioned that I used  
org.apache.spark.rdd.JdbcRDD to read in the RDD from a postgres DB initially.

This was you don't need to use an EventServer!

On Fri, 5 Jan 2018 at 15:37 Daniel O' Shaughnessy  
wrote:
Hi Shane, 

I've successfully used : 

import  
org.apache.spark.ml.classification.{  
RandomForestClassificationModel,  
RandomForestClassifier  
}

with pio. You can access feature importance through the RandomForestClassifier 
also.

Very simple to convert RDDs to DFs as Pat mentioned, something like:

val RDD_2_DF =
sqlContext.createDataFrame(yourRDD).toDF("col1", "col2")

On Thu, 4 Jan 2018 at 23:10 Pat Ferrel  wrote:
Actually there are libs that will read DFs from HBase 
https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html

This is out of band with PIO and should not be used IMO because the schema of 
the EventStore is not guaranteed to remain as-is. The safest way is to 
translate or get DFs integrated to PIO. I think there is an existing Jira that 
request Spark ML support, which assumes DFs. 

On Jan 4, 2018, at 12:25 PM, Pat Ferrel  wrote:

Funny you should ask this. Yes, we are working on a DF based Universal 
Recommender but you have to convert the RDD into a DF since PIO does not read 
out data in the form of a DF (yet). This is a fairly simple step of maybe one 
line of code but would be better supported in PIO itself. The issue is that the 
EventStore uses libs that may not read out DFs, but RDDs. This is certainly the 
case with Elasticsearch, which provides an RDD lib. I haven’t seen one from 
them that read out DFs though it would make a lot of sense for ES especially.

So TLDR; yes, just convert the RDD into a DF for now.

Also please add a feature request as a PIO Jira ticket to look into this. I for 
one would +1

On Jan 4, 2018, at 11:55 AM, Shane Johnson  wrote:

Hello group, Happy new year! Does anyone have a working example or template 
using the DataFrame API vs. the RDD based APIs. We are wanting to migrate to 
using the new DataFrame APIs to take advantage of the Feature Importance 
function for our Regression Random Forest Models.

We are wanting to move from 

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
to
import org.apache.spark.ml.regression.{Random

Re: PIO error

2018-01-23 Thread Pat Ferrel

Unfortunately I can’t possibly guess without more information.

What do the logs say when pio cannot be started? Are all these pio
instances separate, not in a cluster? In other words does each pio server
have all necessary services running on them? I assume none is sleeping like
a laptop does?

I you are worrying, when properly configured PIO is quite stable on servers
that do not sleep. I have never seen a bug that would cause this and have
installed it hundreds of time so lets look through logs and check your
pio-env.sh on a particular machine that is having this problem.


From: bala vivek  
Date: January 22, 2018 at 11:32:17 PM
To: actionml-user 

Subject:  Re: PIO error

Hi Pat,

The PIO has installed on the Ubuntu server, the Dev server and
production servers are hosted in other countries and we are connecting
through VPN from my laptop.
And yes if I do a pio-start-all and pio-stop-all resolves the issue always,
but this issue is re-occurring often and sometimes the PIO service is not
coming up even after multiple Pio restart.

Not sure with the core reason why the service is often getting down.

Regards,
Bala

On Tuesday, January 23, 2018 at 2:47:26 AM UTC+5:30, pat wrote:
>
> If you are using a laptop for a dev machine, when it sleeps it can
> interfere with Zookeeper, which is started and used by HBase. So
> pio-stop-all then pio-start-all restarts HBase and therefor Zookeeper
> gracefully to solve this.
>
> Does the stop/start always solve this?
>
>
>
> From: bala vivek 
> Date: January 21, 2018 at 10:39:31 PM
> To: actionml-user 
> Subject:  PIO error
>
> Hi,
>
> I'm getting the following error in pio.
>
> pio status gives me the below result,
>
> [INFO] [Console$] Inspecting PredictionIO...
> [INFO] [Console$] PredictionIO 0.10.0-incubating is installed at
> /opt/tools/PredictionIO-0.10.0-incubating
> [INFO] [Console$] Inspecting Apache Spark...
> [INFO] [Console$] Apache Spark is installed at
> /opt/tools/PredictionIO-0.10.0-incubating/vendors/spark-1.
> 6.3-bin-hadoop2.6
> [INFO] [Console$] Apache Spark 1.6.3 detected (meets minimum requirement
> of 1.3.0)
> [INFO] [Console$] Inspecting storage backend connections...
> [INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)...
> [INFO] [Storage$] Verifying Model Data Backend (Source: LOCALFS)...
> [INFO] [Storage$] Verifying Event Data Backend (Source: HBASE)...
> [INFO] [Storage$] Test writing to Event Store (App Id 0)...
> [ERROR] [Console$] Unable to connect to all storage backends successfully.
> The following shows the error message from the storage backend.
> [ERROR] [Console$] Failed after attempts=1, exceptions:
> Mon Jan 22 01:00:02 EST 2018, org.apache.hadoop.hbase.
> client.RpcRetryingCaller@5c5d6175, org.apache.hadoop.hbase.ipc.
> RemoteWithExtrasException(org.apache.hadoop.hbase.PleaseHoldException):
> org.apache.hadoop.hbase.PleaseHoldException: Master is initializing
>at org.apache.hadoop.hbase.master.HMaster.
> checkInitialized(HMaster.java:2293)
>at org.apache.hadoop.hbase.master.HMaster.checkNamespaceManagerReady(
> HMaster.java:2298)
>at org.apache.hadoop.hbase.master.HMaster.listNamespaceDescriptors(
> HMaster.java:2536)
>at org.apache.hadoop.hbase.master.MasterRpcServices.
> listNamespaceDescriptors(MasterRpcServices.java:1100)
>at org.apache.hadoop.hbase.protobuf.generated.
> MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:55734)
>at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2180)
>at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
>at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(
> RpcExecutor.java:133)
>at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
>at java.lang.Thread.run(Thread.java:748)
>
>  (org.apache.hadoop.hbase.client.RetriesExhaustedException)
> [ERROR] [Console$] Dumping configuration of initialized storage backend
> sources. Please make sure they are correct.
> [ERROR] [Console$] Source Name: ELASTICSEARCH; Type: elasticsearch;
> Configuration: TYPE -> elasticsearch, HOME -> /opt/tools/PredictionIO-0.10.
> 0-incubating/vendors/elasticsearch-1.7.3
> [ERROR] [Console$] Source Name: LOCALFS; Type: localfs; Configuration:
> PATH -> /root/.pio_store/models, TYPE -> localfs
> [ERROR] [Console$] Source Name: HBASE; Type: hbase; Configuration: TYPE ->
> hbase, HOME -> /opt/tools/PredictionIO-0.10.0-incubating/vendors/hbase-1.
> 2.4
>
>
> This setup is running in our production and this is not a new setup. Often
> I get this error and if do a pio-stop-all and pio-start-all, pio will work
> fine.
> But why often the pio status is showing error. There was no new
> configuration changes made in the pio-envi.sh file
> --
> You received this message because you are subscribed to the Google Groups
> "actionml-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to actionml-use...@googlegroups.com.
> To post to this group, send email to action..

Re: Prediction IO install failed in Linux

2018-01-23 Thread Pat Ferrel

This would be very difficult to do. Even if you used a machine connected to
the internet to download things like pio, spark, etc. the very build tools
used (sbt) expect to be able to get code from various repositories on the
internet. To build templates would further complicate this since each
template may have different needs.

Perhaps you can take a laptop home, install and build, take it back to work
with all needed code installed. In order to use open source software it is
virtually impossible to work without access to the internet.

From: Praveen Prasannakumar 

Reply: user@predictionio.apache.org 

Date: January 23, 2018 at 7:03:27 AM
To: user@predictionio.apache.org 

Subject:  Re: Prediction IO install failed in Linux

Team - Is there a way to install predictio io offline ? If yes , Can
someone provide some documents for it ?

Thanks
Praveen

On Fri, Jan 19, 2018 at 11:05 AM, Praveen Prasannakumar <
praveen2399wo...@gmail.com> wrote:

> Hello Team
>
> I am trying to install prediction IO in one of our linux box with in
> office network. My company network have firewall and sometimes it wont
> connect to outside servers. I am not sure whether that is the reason on
> failure while executing make-distribution.sh script. Can you please help me
> to figure out how can I install prediction IO with in my office network ?
>
> Attaching the screenshot with error.
>
> 
>
> Thanks
> Praveen
>

ii_jclhr7su0_1610ce9f32410c38
Description: Binary data

Re: Need Help Setting up prediction IO

2018-01-17 Thread Pat Ferrel

PIO uses Postgres, MySQL or other JDBC database from the SQL DBs or (and I 
always use this) HBase. Hbase is a high performance NoSQL DB that scales 
indefinitely.

It is possible to use any DB if you write an EventStore class for it, wrapping 
the DB calls with a virtualization API that is DB independent.

Memory is completely algorithm and data dependent but expect PIO, which uses 
Sparkm which in turn gets it’s speed from keeping data in-memory, to use a lot 
compared to a web server. PIO apps are often in the big data category and many 
deployments require Spark clusters with many G per machine. It is rare to be 
able to run PIO in production on a single machine.

Welcome to big data.


On Jan 11, 2018, at 6:23 PM, Rajesh Jangid  wrote:

Hi, 
Well with version PIO 10 I think some dependency is causing trouble in 
linux, we have figured out a way using Pio for now, and everything is working 
great. 
  Thanks for the support though. 

Few question-
1.Does Pio latest support Mongodb or NoSQL?
2.Memory uses by Pio, Is there any max memory limit set, If need be can it be 
set? 


Thanks
Rajesh 


On Jan 11, 2018 10:25 PM, "Pat Ferrel" mailto:p...@occamsmachete.com>> wrote:
The version in the artifact built by Scala should only have the major version 
number so 2.10 or 2.11. PIO 0.10.0 needs 2.10.  Where, and what variable did 
you set to 2.10.4? That is the problem. There will never be a lib built for 
2.10.4, it will always be 2.10.



On Jan 11, 2018, at 5:15 AM, Daniel O' Shaughnessy mailto:danieljamesda...@gmail.com>> wrote:

Basically you need to make sure all your lib dependencies in build.sbt work 
together. 

On Thu, 11 Jan 2018 at 13:14 Daniel O' Shaughnessy mailto:danieljamesda...@gmail.com>> wrote:
Maybe try v2.10.4 based on this line:

[INFO] [Console$] [error]com.chuusai:shapeless _2.10, _2.10.4

I'm unfamiliar with the ubuntu setup for pio so can't help you there I'm afraid.

On Thu, 11 Jan 2018 at 05:08 Rajesh Jangid mailto:raje...@grazitti.com>> wrote:
I am trying to run this on ubuntu 16.04

On Thu, Jan 11, 2018 at 10:36 AM, Rajesh Jangid mailto:raje...@grazitti.com>> wrote:
Hi, 
  I have tried once again with 2.10 as well but getting following dependency 
error

[INFO] [Console$] [error] Modules were resolved with conflicting cross-version 
suffixes in 
{file:/home/integration/client/PredictionIO-0.10/Engines/MyRecommendation/}myrecommendation:
[INFO] [Console$] [error]com.chuusai:shapeless _2.10, _2.10.4
[INFO] [Console$] java.lang.RuntimeException: Conflicting cross-version 
suffixes in: com.chuusai:shapeless
[INFO] [Console$] at scala.sys.package$.error(package.scala:27)
[INFO] [Console$] at 
sbt.ConflictWarning$.processCrossVersioned(ConflictWarning.scala:46)
[INFO] [Console$] at sbt.ConflictWarning$.apply(ConflictWarning.scala:32)
[INFO] [Console$] at sbt.Classpaths$$anonfun$100.apply(Defaults.scala:1300)
[INFO] [Console$] at sbt.Classpaths$$anonfun$100.apply(Defaults.scala:1297)
[INFO] [Console$] at 
scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
[INFO] [Console$] at 
sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
[INFO] [Console$] at sbt.std.Transform$$anon$4.work 
<http://4.work/>(System.scala:63)
[INFO] [Console$] at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
[INFO] [Console$] at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
[INFO] [Console$] at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
[INFO] [Console$] at sbt.Execute.work 
<http://sbt.execute.work/>(Execute.scala:237)
[INFO] [Console$] at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
[INFO] [Console$] at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
[INFO] [Console$] at 
sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
[INFO] [Console$] at 
sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
[INFO] [Console$] at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
[INFO] [Console$] at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[INFO] [Console$] at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
[INFO] [Console$] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[INFO] [Console$] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[INFO] [Console$] at java.lang.Thread.run(Thread.java:745)
[INFO] [Console$] [error] (*:update) Conflicting cross-version suffixes in: 
com.chuusai:shapeless
[INFO] [Console$] [error] Total time: 6 s, completed Jan 11, 2018 5:03:51 AM
[ERROR] [Console$] Return code of previous step is 1. Aborting.


On Wed, Jan 10, 2018 at 10:03 PM, Daniel O' Shaughnessy 
mailto:danieljamesda...@gmail.com>> wrote:
I've pulled down this version without an

The Universal Recommender v0.7.0

2018-01-17 Thread Pat Ferrel

We have been waiting to release the UR v0.7.0 for testing (done) and the 
release of Mahout v0.13.1 (not done) Today we have released the UR v0.7.0 
anyway. This comes with:
Support for PIO v0.12.0
Requires Scala 2.11 (can be converted to use Scala 2.10 but it’s a manual 
process)
Requires Elasticsearch 5.X, and uses the REST client exclusively. This enables 
Elasticsearch authentication if needed.
Speed improvements for queries (ES 5.x is faster) and model building (a 
snapshot build of Mahout includes speedups)
Requires a source build of Mahout from a version forked by ActionML. This 
requirement will be removed as soon as Mahout releases v0.13.1, which will be 
incorporated in UR v0.7.1 asap. Follow special build instructions in the UR’s 
README.md.
Fixes a bug in the business rules for excluding items with certain properties

Report issues on the GitHub repo here: 
https://github.com/actionml/universal-recommender 
 get tag v0.7.0 for `pio 
build` and be sure to read the instructions and warnings on the README.md there.

Ask questions on the Google Group here: 
https://groups.google.com/forum/#!forum/actionml-user 
 or on the PIO user list.

Re: Need Help Setting up prediction IO

2018-01-11 Thread Pat Ferrel

The version in the artifact built by Scala should only have the major version 
number so 2.10 or 2.11. PIO 0.10.0 needs 2.10.  Where, and what variable did 
you set to 2.10.4? That is the problem. There will never be a lib built for 
2.10.4, it will always be 2.10.



On Jan 11, 2018, at 5:15 AM, Daniel O' Shaughnessy  
wrote:

Basically you need to make sure all your lib dependencies in build.sbt work 
together. 

On Thu, 11 Jan 2018 at 13:14 Daniel O' Shaughnessy mailto:danieljamesda...@gmail.com>> wrote:
Maybe try v2.10.4 based on this line:

[INFO] [Console$] [error]com.chuusai:shapeless _2.10, _2.10.4

I'm unfamiliar with the ubuntu setup for pio so can't help you there I'm afraid.

On Thu, 11 Jan 2018 at 05:08 Rajesh Jangid mailto:raje...@grazitti.com>> wrote:
I am trying to run this on ubuntu 16.04

On Thu, Jan 11, 2018 at 10:36 AM, Rajesh Jangid mailto:raje...@grazitti.com>> wrote:
Hi, 
  I have tried once again with 2.10 as well but getting following dependency 
error

[INFO] [Console$] [error] Modules were resolved with conflicting cross-version 
suffixes in 
{file:/home/integration/client/PredictionIO-0.10/Engines/MyRecommendation/}myrecommendation:
[INFO] [Console$] [error]com.chuusai:shapeless _2.10, _2.10.4
[INFO] [Console$] java.lang.RuntimeException: Conflicting cross-version 
suffixes in: com.chuusai:shapeless
[INFO] [Console$] at scala.sys.package$.error(package.scala:27)
[INFO] [Console$] at 
sbt.ConflictWarning$.processCrossVersioned(ConflictWarning.scala:46)
[INFO] [Console$] at sbt.ConflictWarning$.apply(ConflictWarning.scala:32)
[INFO] [Console$] at sbt.Classpaths$$anonfun$100.apply(Defaults.scala:1300)
[INFO] [Console$] at sbt.Classpaths$$anonfun$100.apply(Defaults.scala:1297)
[INFO] [Console$] at 
scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
[INFO] [Console$] at 
sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
[INFO] [Console$] at sbt.std.Transform$$anon$4.work 
(System.scala:63)
[INFO] [Console$] at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
[INFO] [Console$] at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
[INFO] [Console$] at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
[INFO] [Console$] at sbt.Execute.work 
(Execute.scala:237)
[INFO] [Console$] at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
[INFO] [Console$] at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
[INFO] [Console$] at 
sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
[INFO] [Console$] at 
sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
[INFO] [Console$] at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
[INFO] [Console$] at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[INFO] [Console$] at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
[INFO] [Console$] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[INFO] [Console$] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[INFO] [Console$] at java.lang.Thread.run(Thread.java:745)
[INFO] [Console$] [error] (*:update) Conflicting cross-version suffixes in: 
com.chuusai:shapeless
[INFO] [Console$] [error] Total time: 6 s, completed Jan 11, 2018 5:03:51 AM
[ERROR] [Console$] Return code of previous step is 1. Aborting.


On Wed, Jan 10, 2018 at 10:03 PM, Daniel O' Shaughnessy 
mailto:danieljamesda...@gmail.com>> wrote:
I've pulled down this version without any modifications and run with pio v0.10 
on a mac and it builds with no issues.

However, when I add in scalaVersion := "2.11.8" to build.sbt I get a dependency 
error.

pio v0.10 supports scala 2.10 so you need to switch to this to run! 

On Wed, 10 Jan 2018 at 13:47 Rajesh Jangid mailto:raje...@grazitti.com>> wrote:
Yes, v0.5.0

On Jan 10, 2018 7:07 PM, "Daniel O' Shaughnessy" mailto:danieljamesda...@gmail.com>> wrote:
Is this the template you're using? 

https://github.com/apache/predictionio-template-ecom-recommender 


On Wed, 10 Jan 2018 at 13:16 Rajesh Jangid mailto:raje...@grazitti.com>> wrote:
Yes, 
We have dependency with elastic and we have elastic 1.4.4 already running. 
We Do not want to run another elastic instance.
Latest prediction IO does not support elastic 1.4.4


On Wed, Jan 10, 2018 at 6:25 PM, Daniel O' Shaughnessy 
mailto:danieljamesda...@gmail.com>> wrote:
Strangedo you absolutely need to run this with pio v0.10? 

On Wed, 10 Jan 2018 at 12:50 Rajesh Jangid mailto:raje...@grazitti.com>> wrote:
{"pio": {"version": { "min": "0.10.0-incubating" }}}


On Wed, Jan 10, 2018 at 6:16 PM, Daniel O' Shaughnessy 
mailto:danieljamesda...@gmail.com>> wrote:
OK that looks fine. What version is PredictionIO set to in temp

Re: Using Dataframe API vs. RDD API?

2018-01-05 Thread Pat Ferrel

Yes and I do not recommend that because the EventServer schema is not a 
developer contract. It may change at any time. Use the conversion method and go 
through the PIO API to get the RDD then convert to DF for now.

I’m not sure what PIO uses to get an RDD from Postgres but if they do not use 
something like the lib you mention, a PR would be nice. Also if you have an 
interest in adding the DF APIs to the EventServer contributions are encouraged. 
Committers will give some guidance I’m sure—once that know more than me on the 
subject.

If you want to donate some DF code, create a Jira and we’ll easily find a 
mentor to make suggestions. There are many benefits to this including not 
having to support a fork of PIO through subsequent versions. Also others are 
interested in this too.

 

On Jan 5, 2018, at 7:39 AM, Daniel O' Shaughnessy  
wrote:

Should have mentioned that I used org.apache.spark.rdd.JdbcRDD to read in 
the RDD from a postgres DB initially.

This was you don't need to use an EventServer!

On Fri, 5 Jan 2018 at 15:37 Daniel O' Shaughnessy mailto:danieljamesda...@gmail.com>> wrote:
Hi Shane, 

I've successfully used : 

import org.apache.spark.ml.classification.{ RandomForestClassificationModel, 
RandomForestClassifier }

with pio. You can access feature importance through the RandomForestClassifier 
also.

Very simple to convert RDDs to DFs as Pat mentioned, something like:

val RDD_2_DF = sqlContext.createDataFrame(yourRDD).toDF("col1", "col2")



On Thu, 4 Jan 2018 at 23:10 Pat Ferrel mailto:p...@occamsmachete.com>> wrote:
Actually there are libs that will read DFs from HBase 
https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html
 
<https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html>

This is out of band with PIO and should not be used IMO because the schema of 
the EventStore is not guaranteed to remain as-is. The safest way is to 
translate or get DFs integrated to PIO. I think there is an existing Jira that 
request Spark ML support, which assumes DFs. 


On Jan 4, 2018, at 12:25 PM, Pat Ferrel mailto:p...@occamsmachete.com>> wrote:

Funny you should ask this. Yes, we are working on a DF based Universal 
Recommender but you have to convert the RDD into a DF since PIO does not read 
out data in the form of a DF (yet). This is a fairly simple step of maybe one 
line of code but would be better supported in PIO itself. The issue is that the 
EventStore uses libs that may not read out DFs, but RDDs. This is certainly the 
case with Elasticsearch, which provides an RDD lib. I haven’t seen one from 
them that read out DFs though it would make a lot of sense for ES especially.

So TLDR; yes, just convert the RDD into a DF for now.

Also please add a feature request as a PIO Jira ticket to look into this. I for 
one would +1


On Jan 4, 2018, at 11:55 AM, Shane Johnson mailto:shanewaldenjohn...@gmail.com>> wrote:

Hello group, Happy new year! Does anyone have a working example or template 
using the DataFrame API vs. the RDD based APIs. We are wanting to migrate to 
using the new DataFrame APIs to take advantage of the Feature Importance 
function for our Regression Random Forest Models.

We are wanting to move from 

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
to
import org.apache.spark.ml.regression.{RandomForestRegressionModel, 
RandomForestRegressor}

Is this something that should be fairly straightforward by adjusting parameters 
and calling new classes within DASE or is it much more involved development.

Thank You!
Shane Johnson | 801.360.3350 
LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook 
<https://www.facebook.com/shane.johnson.71653>

Re: Using Dataframe API vs. RDD API?

2018-01-04 Thread Pat Ferrel

Actually there are libs that will read DFs from HBase 
https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html
 
<https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html>

This is out of band with PIO and should not be used IMO because the schema of 
the EventStore is not guaranteed to remain as-is. The safest way is to 
translate or get DFs integrated to PIO. I think there is an existing Jira that 
request Spark ML support, which assumes DFs. 


On Jan 4, 2018, at 12:25 PM, Pat Ferrel  wrote:

Funny you should ask this. Yes, we are working on a DF based Universal 
Recommender but you have to convert the RDD into a DF since PIO does not read 
out data in the form of a DF (yet). This is a fairly simple step of maybe one 
line of code but would be better supported in PIO itself. The issue is that the 
EventStore uses libs that may not read out DFs, but RDDs. This is certainly the 
case with Elasticsearch, which provides an RDD lib. I haven’t seen one from 
them that read out DFs though it would make a lot of sense for ES especially.

So TLDR; yes, just convert the RDD into a DF for now.

Also please add a feature request as a PIO Jira ticket to look into this. I for 
one would +1


On Jan 4, 2018, at 11:55 AM, Shane Johnson mailto:shanewaldenjohn...@gmail.com>> wrote:

Hello group, Happy new year! Does anyone have a working example or template 
using the DataFrame API vs. the RDD based APIs. We are wanting to migrate to 
using the new DataFrame APIs to take advantage of the Feature Importance 
function for our Regression Random Forest Models.

We are wanting to move from 

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
to
import org.apache.spark.ml.regression.{RandomForestRegressionModel, 
RandomForestRegressor}

Is this something that should be fairly straightforward by adjusting parameters 
and calling new classes within DASE or is it much more involved development.

Thank You!
Shane Johnson | 801.360.3350

LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook 
<https://www.facebook.com/shane.johnson.71653>

Re: Using Dataframe API vs. RDD API?

2018-01-04 Thread Pat Ferrel

Funny you should ask this. Yes, we are working on a DF based Universal 
Recommender but you have to convert the RDD into a DF since PIO does not read 
out data in the form of a DF (yet). This is a fairly simple step of maybe one 
line of code but would be better supported in PIO itself. The issue is that the 
EventStore uses libs that may not read out DFs, but RDDs. This is certainly the 
case with Elasticsearch, which provides an RDD lib. I haven’t seen one from 
them that read out DFs though it would make a lot of sense for ES especially.

So TLDR; yes, just convert the RDD into a DF for now.

Also please add a feature request as a PIO Jira ticket to look into this. I for 
one would +1


On Jan 4, 2018, at 11:55 AM, Shane Johnson  wrote:

Hello group, Happy new year! Does anyone have a working example or template 
using the DataFrame API vs. the RDD based APIs. We are wanting to migrate to 
using the new DataFrame APIs to take advantage of the Feature Importance 
function for our Regression Random Forest Models.

We are wanting to move from 

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
to
import org.apache.spark.ml.regression.{RandomForestRegressionModel, 
RandomForestRegressor}

Is this something that should be fairly straightforward by adjusting parameters 
and calling new classes within DASE or is it much more involved development.

Thank You!
Shane Johnson | 801.360.3350

LinkedIn  | Facebook

Re: Error: "unable to undeploy"

2018-01-03 Thread Pat Ferrel

The UR does not require more than one deploy (assuming the server runs 
forever). Retraining the UR automatically re-deploys the new model. 

All other Engines afaik do require retrain-redeploy.

Users should be aware that PIO is a framework that provides no ML function 
whatsoever. It supports a workflow but Engines are free to simplify or use it 
in different ways so always preface a question with what Engine you are using 
or asking about.

On Jan 3, 2018, at 4:33 AM, Noelia Osés Fernández  wrote:

Hi lokotochek,

You mentioned that it wasn't necessary to redeploy after retraining. However, 
today I have come across a PIO wepage that I hadn't seen before that tells me 
to redeploy after retraining (section 'Update Model with New Data'):

http://predictionio.incubator.apache.org/deploy/ 

Particularly, this page suggests adding the following line to the crontab to 
retrain every day:

0 0 * * *   $PIO_HOME/bin/pio train; $PIO_HOME/bin/pio deploy

Here it is clear that it is redeploying after retraining. So does it not 
actually hot-swap the model? Or the UR does but this page is more general for 
other templates 
that might not do that?

Thank for your help!

On 14 December 2017 at 15:57, Александр Лактионов mailto:lokotoc...@gmail.com>> wrote:
Hi Noelia,
you dont have to redeploy your app after train. It will be hot-swapped and the 
previous procces (ran by pio deploy) will change recommendations automatically
> 14 дек. 2017 г., в 17:56, Noelia Osés Fernández  > написал(а):
> 
> Hi,
> 
> The first time after reboot that I train and deploy my PIO app everything 
> works well. However, if I then retrain and deploy again, I get the following 
> error: 
> 
> [INFO] [MasterActor] Undeploying any existing engine instance at 
> http://0.0.0.0:8000 
> [ERROR] [MasterActor] Another process might be occupying 0.0.0.0:8000 
> . Unable to undeploy.
> [ERROR] [TcpListener] Bind failed for TCP channel on endpoint [/0.0.0.0:8000 
> ]
> [WARN] [HttpListener] Bind to /0.0.0.0:8000  failed
> [ERROR] [MasterActor] Bind failed. Retrying... (2 more trial(s))
> [WARN] [HttpListener] Bind to /0.0.0.0:8000  failed
> [ERROR] [TcpListener] Bind failed for TCP channel on endpoint [/0.0.0.0:8000 
> ]
> [ERROR] [MasterActor] Bind failed. Retrying... (1 more trial(s))
> [ERROR] [TcpListener] Bind failed for TCP channel on endpoint [/0.0.0.0:8000 
> ]
> [WARN] [HttpListener] Bind to /0.0.0.0:8000  failed
> [ERROR] [MasterActor] Bind failed. Retrying... (0 more trial(s))
> [ERROR] [TcpListener] Bind failed for TCP channel on endpoint [/0.0.0.0:8000 
> ]
> [WARN] [HttpListener] Bind to /0.0.0.0:8000  failed
> [ERROR] [MasterActor] Bind failed. Shutting down.
> 
> I thought it was possible to retrain an app that was running and then deploy 
> again.
> Is this not possible?
> 
> How can I kill the running instance?
> I've tried the trick in handmade's integration test but it doesn't work:
> 
> deploy_pid=`jps -lm | grep "onsole deploy" | cut -f 1 -d ' '`
> echo "Killing the deployed test PredictionServer"
> kill "$deploy_pid"
> 
> I still get the same error after doing this.
> 
> Any help is much appreciated.
> Best regards,
> Noelia
> 
> 
> 
> 
> 
> 
>

Re: App still returns results after pio app data-delete

2018-01-02 Thread Pat Ferrel

BTW there is a new Chrome extension that lets you browse ES and create any JSON 
query. Just found it myself after Sense stopped working in Chrome. Try 
ElasticSearch Head, found in the Chrome store.


On Jan 2, 2018, at 9:53 AM, Pat Ferrel  wrote:

Have a look at the ES docs on their site. There are several ways, from sending 
a JSON command to deleting the data directory depending on how clean you want 
ES to be.

In general my opinion is that PIO is an integration framework for several 
services and for the majority of applications you will not need to deal 
directly with the services except for setup. This may be an exception. In all 
cases you may find it best to seek guidance from the support communities or 
docs of those services.

If you are sending a REST JSON command directive it would be as shown here: 
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/indices-delete-index.html
 
<https://www.elastic.co/guide/en/elasticsearch/reference/1.7/indices-delete-index.html>

$ curl -XDELETE 'http://localhost:9200// 
<http://localhost:9200/%3Cindex_name%3E/>'

The Index name is named in the UR engine.json or in pio-env depending on which 
index you want to delete.


On Jan 2, 2018, at 12:22 AM, Noelia Osés Fernández mailto:no...@vicomtech.org>> wrote:

Thanks for the explanation!

How do I delete the ES index? is it just DELETE /my_index_name?

Happy New Year!

On 22 December 2017 at 19:42, Pat Ferrel mailto:p...@occamsmachete.com>> wrote:
With PIO the model is managed by the user, not PIO. The input is separate and 
can be deleted without affecting the model.

Each Engine handles model’s it’s own way but most use the model storage in 
pio-env. So deleting those will get rid of the model. The UR keeps the model in 
ES under the “indexName” and “typeName” in engine.json. So you need to delete 
the index if you want to stop queries from working. The UR maintain’s one live 
copy of the model and removes old ones after a new one is made live so there 
will only ever be one model (unless you have changed your indexName often)


On Dec 21, 2017, at 4:58 AM, Noelia Osés Fernández mailto:no...@vicomtech.org>> wrote:

Hi all!

I have executed a pio app data-delete MyApp. The command has outputted the 
following:

[INFO] [HBLEvents] Removing table pio_event:events_4...
[INFO] [App$] Removed Event Store for the app ID: 4
[INFO] [HBLEvents] The table pio_event:events_4 doesn't exist yet. Creating 
now...
[INFO] [App$] Initialized Event Store for the app ID: 4


However, I executed

curl -H "Content-Type: application/json" -d '
{
}' http://localhost:8000/queries.json <http://localhost:8000/queries.json>

after deleting the data and I still get the same results as before deleting the 
data. Why is this happening?

I expected to get either an error message or an empty result like 
{"itemScores":[]}.

Any help is much appreciated.
Best regards,
Noelia




-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher | Investigadora Senior

no...@vicomtech.org <mailto:no...@vicomtech.org>
+[34] 943 30 92 30
Data Intelligence for Energy and Industrial Processes | Inteligencia de Datos 
para Energía y Procesos Industriales

 <https://www.linkedin.com/company/vicomtech>  
<https://www.youtube.com/user/VICOMTech>  <https://twitter.com/@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

Re: App still returns results after pio app data-delete

2018-01-02 Thread Pat Ferrel

Have a look at the ES docs on their site. There are several ways, from sending 
a JSON command to deleting the data directory depending on how clean you want 
ES to be.

In general my opinion is that PIO is an integration framework for several 
services and for the majority of applications you will not need to deal 
directly with the services except for setup. This may be an exception. In all 
cases you may find it best to seek guidance from the support communities or 
docs of those services.

If you are sending a REST JSON command directive it would be as shown here: 
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/indices-delete-index.html
 
<https://www.elastic.co/guide/en/elasticsearch/reference/1.7/indices-delete-index.html>

$ curl -XDELETE 'http://localhost:9200//'

The Index name is named in the UR engine.json or in pio-env depending on which 
index you want to delete.


On Jan 2, 2018, at 12:22 AM, Noelia Osés Fernández  wrote:

Thanks for the explanation!

How do I delete the ES index? is it just DELETE /my_index_name?

Happy New Year!

On 22 December 2017 at 19:42, Pat Ferrel mailto:p...@occamsmachete.com>> wrote:
With PIO the model is managed by the user, not PIO. The input is separate and 
can be deleted without affecting the model.

Each Engine handles model’s it’s own way but most use the model storage in 
pio-env. So deleting those will get rid of the model. The UR keeps the model in 
ES under the “indexName” and “typeName” in engine.json. So you need to delete 
the index if you want to stop queries from working. The UR maintain’s one live 
copy of the model and removes old ones after a new one is made live so there 
will only ever be one model (unless you have changed your indexName often)


On Dec 21, 2017, at 4:58 AM, Noelia Osés Fernández mailto:no...@vicomtech.org>> wrote:

Hi all!

I have executed a pio app data-delete MyApp. The command has outputted the 
following:

[INFO] [HBLEvents] Removing table pio_event:events_4...
[INFO] [App$] Removed Event Store for the app ID: 4
[INFO] [HBLEvents] The table pio_event:events_4 doesn't exist yet. Creating 
now...
[INFO] [App$] Initialized Event Store for the app ID: 4


However, I executed

curl -H "Content-Type: application/json" -d '
{
}' http://localhost:8000/queries.json <http://localhost:8000/queries.json>

after deleting the data and I still get the same results as before deleting the 
data. Why is this happening?

I expected to get either an error message or an empty result like 
{"itemScores":[]}.

Any help is much appreciated.
Best regards,
Noelia




-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher | Investigadora Senior

no...@vicomtech.org <mailto:no...@vicomtech.org>
+[34] 943 30 92 30
Data Intelligence for Energy and Industrial Processes | Inteligencia de Datos 
para Energía y Procesos Industriales

 <https://www.linkedin.com/company/vicomtech>  
<https://www.youtube.com/user/VICOMTech>  <https://twitter.com/@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

Re: Recommendation return score more than 5

2017-12-22 Thread Pat Ferrel

I did not write the template you are using. I am trying to explain what the
template should be doing and how ALS works. I’m sure that with exactly the same
data you should get the same results but in real life you will need to
understand the algorithm a little deeper and so the pointer to the code that is
being executed by the template from Spark MLlib. If this is not helpful please
ignore the advice.

On Dec 22, 2017, at 11:16 AM, GMAIL wrote:

But I strictly followed the instructions from the site and did not change
anything even. Everything I did was steps from this page. I did not perform any
additional operations, including editing the source code.

Instruction (Quick Start - Recommendation Engine Template):
http://predictionio.incubator.apache.org/templates/recommendation/quickstart/
<http://predictionio.incubator.apache.org/templates/recommendation/quickstart/>

2017-12-22 22:12 GMT+03:00 Pat Ferrel mailto:p...@occamsmachete.com>>:
Implicit means you assign a score to the event based on your own guess.
Explicit uses ratings the user makes. One score is a guess by you (like a 4 for
buy) and the other is a rating made by the user. ALS comes in 2 flavors, one
for explicit scoring, used to predict rating and the other for implicit scoring
used to predict something the user will prefer.

Make sure your template is using the explicit version of ALS.
https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html#explicit-vs-implicit-feedback

<https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html#explicit-vs-implicit-feedback>

On Dec 21, 2017, at 11:09 PM, GMAIL mailto:babaevka...@gmail.com>> wrote:

I wanted to use the Recomender because I expected that it could predict the
scores as it is done by MovieLens. And it seems to be doing so, but for some
reason the input and output scale is different. In imported scores, from 1 to
5, and in the predicted from 1 to 10.

If by implicit scores you mean events without parameters, then I am aware that
in essence there is also an score. I watched the DataSource in Recommender and
there were only two events: rate and buy. Rate takes an score, and the buy
implicitly puts the rating at 4 (out of 5, as I think).

And I still did not understand exactly where to look for me and what to
correct, so that incoming and predicted estimates were on the same scale.

2017-12-19 4:10 GMT+03:00 Pat Ferrel mailto:p...@occamsmachete.com>>:
There are 2 types of MLlib ALS recommenders last I checked, implicit and
explicit. Implicit ones you give any arbitrary score, like a 1 for purchase.
The explicit one you can input ratings and it is expected to predict ratings
for an individual. But both iirc also have a regularization parameter that
affects the scoring and is a param so you have to experiment with it using
cross-validation to see where you get the best results.

There is an old metric used for this type of thing called RMSE
(root-mean-square error) which, when minimized will give you scores that most
closely match actual scores in the hold-out set (see wikipedia on
cross-validation and RMSE). You may have to use explicit ALS and tweak the
regularization param, to get the lowest RMSE. I doubt anything will guarantee
them to be in exactly the range of ratings so you’ll then need to pick the
closest rating.

On Dec 18, 2017, at 10:42 AM, GMAIL mailto:babaevka...@gmail.com>> wrote:

That is, the predicted scores that the Recommender returns can not just be
multiplied by two, but may be completely wrong?
I can not, say, just divide the predictions by 2 and pretend that everything is
fine?

2017-12-18 21:35 GMT+03:00 Pat Ferrel mailto:p...@occamsmachete.com>>:
The UR and the Recommendations Template use very different technology
underneath.

In general the scores you get from recommenders are meaningless on their own.
When using ratings as numerical values with a ”Matrix Factorization”
recommender like the ones in MLlib, upon which the Recommendations Template is
based need to have a regularization parameter. I don’t know for sure but maybe
this is why the results don’t come in the range of input ratings. I haven’t
looked at the code in a long while.

If you are asking about the UR it would not take numeric ratings and the scores
cannot be compared to them.

For many reasons that I have written about before I always warn people about
using ratings, which have been discontinued as a source of input for Netflix
(who have removed them from their UX) and many other top recommender users.
There are many reasons for this, not the least of which is that they are
ambiguous and don’t directly relate to whether a user might like an item. For
instance most video sources now use something like the length of time a user
watches a video, and review sites prefer “like” and “dislike”. The first is
implicit and the second is quite unambiguous.

On Dec 18, 2017, at 12:32 AM, GMAIL m

Re: How to import item properties dynamically?

2017-12-22 Thread Pat Ferrel

The properties go into the Event Store immediately but you have to train to get 
them into the model, this assuming your template support item properties. If yo 
uare using the UR, the properties will not get into the model until the next 
`pio train…`


On Dec 22, 2017, at 3:37 AM, Noelia Osés Fernández  wrote:


Hi all,

I have a pio app and I need to update item properties regularly. However, not 
all items will have all properties always. So I want to update the properties 
dynamically doing something similiar to the following:

# create properties json
propertiesjson = '{'
if "tiempo" in dfcolumns:
propertiesjson = propertiesjson + '"tiempo": ' + 
str(int(plan.tiempo))
if "duracion" in dfcolumns:
propertiesjson = propertiesjson + ', "duracion": ' + 
str(plan.duracion)
propertiesjson = propertiesjson + '}'

# add event
client.create_event(
event="$set",
entity_type="item",
entity_id=plan.id_product,
properties=json.dumps(propertiesjson)
)


However, this results in an error message:


Traceback (most recent call last):
  File "import_itemproperties.py", line 110, in 
import_events(client, args.dbuser, args.dbpasswd, args.dbhost, args.dbname)
  File "import_itemproperties.py", line 73, in import_events
properties=json.dumps(propertiesjson)
  File 
"/home/ubuntu/.local/lib/python2.7/site-packages/predictionio/__init__.py", 
line 255, in create_event
event_time).get_response()
  File 
"/home/ubuntu/.local/lib/python2.7/site-packages/predictionio/connection.py", 
line 111, in get_response
self._response = self.rfunc(tmp_response)
  File 
"/home/ubuntu/.local/lib/python2.7/site-packages/predictionio/__init__.py", 
line 130, in _acreate_resp
response.body))
predictionio.NotCreatedError: request: POST 
/events.json?accessKey=0Hys1qwfgo3vF16jElBDJJnSLmrkN5Tg86qAPqepYPK_-lXMqI4NMjLXaBGgQJ4U
 {'entityId': 8, 'entityType': 'item', 'properties': '"{\\"tiempo\\": 2, 
\\"duracion\\": 60}"', 'event': '$set', 'eventTime': 
'2017-12-22T11:29:59.762+'} 
/events.json?accessKey=0Hys1qwfgo3vF16jElBDJJnSLmrkN5Tg86qAPqepYPK_-lXMqI4NMjLXaBGgQJ4U?entityId=8&entityType=item&properties=%22%7B%5C%22tiempo%5C%22%3A+2%2C+%5C%22duracion%5C%22%3A+60%2C&event=%24set&eventTime=2017-12-22T11%3A29%3A59.762%2B
 status: 400 body: {"message":"org.json4s.package$MappingException: Expected 
object but got JString(\"{\\\"tiempo\\\": 2, \\\"duracion\\\": 60}\")"}


Any help is much appreciated!
Season's greetings!
Noelia

Re: Recommendation return score more than 5

2017-12-22 Thread Pat Ferrel

Implicit means you assign a score to the event based on your own guess.
Explicit uses ratings the user makes. One score is a guess by you (like a 4 for
buy) and the other is a rating made by the user. ALS comes in 2 flavors, one
for explicit scoring, used to predict rating and the other for implicit scoring
used to predict something the user will prefer.

Make sure your template is using the explicit version of ALS.
https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html#explicit-vs-implicit-feedback

<https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html#explicit-vs-implicit-feedback>

On Dec 21, 2017, at 11:09 PM, GMAIL wrote:

And I still did not understand exactly where to look for me and what to
correct, so that incoming and predicted estimates were on the same scale.

On Dec 18, 2017, at 10:42 AM, GMAIL mailto:babaevka...@gmail.com>> wrote:

2017-12-18 21:35 GMT+03:00 Pat Ferrel mailto:p...@occamsmachete.com>>:
The UR and the Recommendations Template use very different technology
underneath.

If you are asking about the UR it would not take numeric ratings and the scores
cannot be compared to them.

On Dec 18, 2017, at 12:32 AM, GMAIL mailto:babaevka...@gmail.com>> wrote:

Does it seem to me or UR strongly differs from Recommender?
At least I can't find method getRatings in class DataSource, which contains all
events, in particular, "rate", that I needed.

2017-12-18 11:14 GMT+03:00 Noelia Osés Fernández mailto:no...@vicomtech.org>>:
I didn't solve the problem :(

Now I use the universal recommender

On 18 December 2017 at 09:12, GMAIL mailto:babaevka...@gmail.com>> wrote:
And how did you solve this problem? Did you divide prediction score by 2?

2017-12-18 10:40 GMT+03:00 Noelia Osés Fernández mailto:no...@vicomtech.org>>:
I got the same problem. I still don't know the answer to your question :(

On 17 December 2017 at 14:07, GMAIL mailto:babaevka...@gmail.com>> wrote:
I thought that there was a 5 point scale, but if so, why do I get predictions
of 7, 8, etc.?

P.S. Sorry for my English.

2017-12-17 16:05 GMT+03:00 GMAIL mailto:babaevka...@gmail.com>>:
Hi.
I train with

Re: App still returns results after pio app data-delete

2017-12-22 Thread Pat Ferrel

With PIO the model is managed by the user, not PIO. The input is separate and 
can be deleted without affecting the model.

Each Engine handles model’s it’s own way but most use the model storage in 
pio-env. So deleting those will get rid of the model. The UR keeps the model in 
ES under the “indexName” and “typeName” in engine.json. So you need to delete 
the index if you want to stop queries from working. The UR maintain’s one live 
copy of the model and removes old ones after a new one is made live so there 
will only ever be one model (unless you have changed your indexName often)

On Dec 21, 2017, at 4:58 AM, Noelia Osés Fernández  wrote:

Hi all!

I have executed a pio app data-delete MyApp. The command has outputted the 
following:

[INFO] [HBLEvents] Removing table pio_event:events_4...
[INFO] [App$] Removed Event Store for the app ID: 4
[INFO] [HBLEvents] The table pio_event:events_4 doesn't exist yet. Creating 
now...
[INFO] [App$] Initialized Event Store for the app ID: 4


However, I executed

curl -H "Content-Type: application/json" -d '
{
}' http://localhost:8000/queries.json 

after deleting the data and I still get the same results as before deleting the 
data. Why is this happening?

I expected to get either an error message or an empty result like 
{"itemScores":[]}.

Any help is much appreciated.
Best regards,
Noelia

Re: Recommendation return score more than 5

2017-12-18 Thread Pat Ferrel

There are 2 types of MLlib ALS recommenders last I checked, implicit and 
explicit. Implicit ones you give any arbitrary score, like a 1 for purchase. 
The explicit one you can input ratings and it is expected to predict ratings 
for an individual. But both iirc also have a regularization parameter that 
affects the scoring and is a param so you have to experiment with it using 
cross-validation to see where you get the best results.

There is an old metric used for this type of thing called RMSE 
(root-mean-square error) which, when minimized will give you scores that most 
closely match actual scores in the hold-out set (see wikipedia on 
cross-validation and RMSE). You may have to use explicit ALS and tweak the 
regularization param, to get the lowest RMSE. I doubt anything will guarantee 
them to be in exactly the range of ratings so you’ll then need to pick the 
closest rating.


On Dec 18, 2017, at 10:42 AM, GMAIL  wrote:

That is, the predicted scores that the Recommender returns can not just be 
multiplied by two, but may be completely wrong? 
I can not, say, just divide the predictions by 2 and pretend that everything is 
fine?

2017-12-18 21:35 GMT+03:00 Pat Ferrel mailto:p...@occamsmachete.com>>:
The UR and the Recommendations Template use very different technology 
underneath. 

In general the scores you get from recommenders are meaningless on their own. 
When using ratings as numerical values with a ”Matrix Factorization” 
recommender like the ones in MLlib, upon which the Recommendations Template is 
based need to have a regularization parameter. I don’t know for sure but maybe 
this is why the results don’t come in the range of input ratings. I haven’t 
looked at the code in a long while.

If you are asking about the UR it would not take numeric ratings and the scores 
cannot be compared to them.

For many reasons that I have written about before I always warn people about 
using ratings, which have been discontinued as a source of input for Netflix 
(who have removed them from their UX) and many other top recommender users. 
There are many reasons for this, not the least of which is that they are 
ambiguous and don’t directly relate to whether a user might like an item. For 
instance most video sources now use something like the length of time a user 
watches a video, and review sites prefer “like” and “dislike”. The first is 
implicit and the second is quite unambiguous. 


On Dec 18, 2017, at 12:32 AM, GMAIL mailto:babaevka...@gmail.com>> wrote:

Does it seem to me or UR strongly differs from Recommender?
At least I can't find method getRatings in class DataSource, which contains all 
events, in particular, "rate", that I needed.

2017-12-18 11:14 GMT+03:00 Noelia Osés Fernández mailto:no...@vicomtech.org>>:
I didn't solve the problem :(

Now I use the universal recommender

On 18 December 2017 at 09:12, GMAIL mailto:babaevka...@gmail.com>> wrote:
And how did you solve this problem? Did you divide prediction score by 2?

2017-12-18 10:40 GMT+03:00 Noelia Osés Fernández mailto:no...@vicomtech.org>>:
I got the same problem. I still don't know the answer to your question :(

On 17 December 2017 at 14:07, GMAIL mailto:babaevka...@gmail.com>> wrote:
I thought that there was a 5 point scale, but if so, why do I get predictions 
of 7, 8, etc.?

P.S. Sorry for my English.

2017-12-17 16:05 GMT+03:00 GMAIL mailto:babaevka...@gmail.com>>:
Hi.
I train with Recommendation Engine Template. 
I use data from sample_movielens_data.txt and there all score less than 5, but 
I get prediction with score more than 5.
What it meaning?




-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

no...@vicomtech.org <mailto:no...@vicomtech.org>
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

 <https://www.linkedin.com/company/vicomtech>  
<https://www.youtube.com/user/VICOMTech>  <https://twitter.com/@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/> <http://www.ik4.es/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>



-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

no...@vicomtech.org <mailto:no...@vicomtech.org>
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

 <https://www.linkedin.com/company/vicomtech>  
<https://www.youtube.com/user/VICOMTech>  <https://twitter.com/@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/> <http://www.ik4.es/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

Re: Recommendation return score more than 5

2017-12-18 Thread Pat Ferrel

The UR and the Recommendations Template use very different technology 
underneath. 

In general the scores you get from recommenders are meaningless on their own. 
When using ratings as numerical values with a ”Matrix Factorization” 
recommender like the ones in MLlib, upon which the Recommendations Template is 
based need to have a regularization parameter. I don’t know for sure but maybe 
this is why the results don’t come in the range of input ratings. I haven’t 
looked at the code in a long while.

If you are asking about the UR it would not take numeric ratings and the scores 
cannot be compared to them.

For many reasons that I have written about before I always warn people about 
using ratings, which have been discontinued as a source of input for Netflix 
(who have removed them from their UX) and many other top recommender users. 
There are many reasons for this, not the least of which is that they are 
ambiguous and don’t directly relate to whether a user might like an item. For 
instance most video sources now use something like the length of time a user 
watches a video, and review sites prefer “like” and “dislike”. The first is 
implicit and the second is quite unambiguous. 

On Dec 18, 2017, at 12:32 AM, GMAIL  wrote:

Does it seem to me or UR strongly differs from Recommender?
At least I can't find method getRatings in class DataSource, which contains all 
events, in particular, "rate", that I needed.

2017-12-18 11:14 GMT+03:00 Noelia Osés Fernández mailto:no...@vicomtech.org>>:
I didn't solve the problem :(

Now I use the universal recommender

On 18 December 2017 at 09:12, GMAIL mailto:babaevka...@gmail.com>> wrote:
And how did you solve this problem? Did you divide prediction score by 2?

2017-12-18 10:40 GMT+03:00 Noelia Osés Fernández mailto:no...@vicomtech.org>>:
I got the same problem. I still don't know the answer to your question :(

On 17 December 2017 at 14:07, GMAIL mailto:babaevka...@gmail.com>> wrote:
I thought that there was a 5 point scale, but if so, why do I get predictions 
of 7, 8, etc.?

P.S. Sorry for my English.

2017-12-17 16:05 GMT+03:00 GMAIL mailto:babaevka...@gmail.com>>:
Hi.
I train with Recommendation Engine Template. 
I use data from sample_movielens_data.txt and there all score less than 5, but 
I get prediction with score more than 5.
What it meaning?




-- 
 

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

no...@vicomtech.org 
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

   
  

member of:   

Legal Notice - Privacy policy 



-- 
 

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

no...@vicomtech.org 
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

   
  

member of:   

Legal Notice - Privacy policy

Re: universal-recommender and 0.12.0-incubating

2017-12-15 Thread Pat Ferrel

yes, the trick is building and putting Mahout in a special place. This step 
will go away in their next release (they keep saying soon).


On Dec 15, 2017, at 8:51 AM, Thibaut Gensollen  wrote:

Yes but you will need the branch 0-7-0-snapshot and carefully follow the 
instructions.

Regards

Le 15 déc. 2017 17:48, "VI, Tran Tan Phong" mailto:tpvit...@prosodie.com>> a écrit :
Hi all,

 

Did someone successfully build, train and deploy a « universal-recommender » 
with Prediction IO “0.12.0-incubating” & Elasticsearch 5.5.2?

 

Thanks,

Phong

This message contains information that may be privileged or confidential and is 
the property of the Capgemini Group. It is intended only for the person to whom 
it is addressed. If you are not the intended recipient, you are not authorized 
to read, print, retain, copy, disseminate, distribute, or use this message or 
any part thereof. If you receive this message in error, please notify the 
sender immediately and delete all copies of this message.

Re: Recommended Configuration

2017-12-15 Thread Pat Ferrel

That is enough for a development machine and may work if you data is relatively 
small but for big data clusters of CPU with a fair amount of RAM and Storage 
are required. The telling factor is partly how big your data is but also how is 
combines to form models, which will depend on which recommender you are using. 

We usually build big clusters to analyze the data then downsize them when we 
see how much is needed. if you have small data, < 1m events, you may try a 
single machine. 


On Dec 15, 2017, at 3:59 AM, GMAIL  wrote:

Hi. 
Could you tell me recommended configuration for comfort work PredictionIO 
Recommender Template. 
I read that I need 16Gb RAM, but what about the rest (CPU/Storage/GPU(?))? 

P.S. sorry for my English.

Re: Error: "unable to undeploy"

2017-12-14 Thread Pat Ferrel

kill -9 will always work if you have permission to kill the process. Did you 
launch it as root (not recommended)?


On Dec 14, 2017, at 7:51 AM, Noelia Osés Fernández  wrote:

I use 

jps -lm | grep "onsole deploy" | cut -f 1 -d ' '

to get the IDs of the processes. However, kill -9 pid doesn't kill them :(



On 14 December 2017 at 16:43, Александр Лактионов mailto:lokotoc...@gmail.com>> wrote:
Yes, pio deploy runs a few processes (usually 2)
Just kill all of them :) usually every of them falls after killing at least one
> 14 дек. 2017 г., в 18:42, Noelia Osés Fernández  > написал(а):
> 
> 
> Hi Phong,
> 
> I have tried your suggestion but I get a few different hits.
> 
> noelia
> 
> On 14 December 2017 at 16:28, VI, Tran Tan Phong  > wrote:
> Hi Noelia,
> 
>  
> 
> Why don’t you try to identify the process by “ps -ef |grep 8000” then simply 
> kill it?
> 
>  
> 
> Phong
> 
>  
> 
> De : Noelia Osés Fernández [mailto:no...@vicomtech.org 
> ] 
> Envoyé : jeudi 14 décembre 2017 15:56
> À : u...@predictionio.incubator.apache.org 
> 
> Objet : Error: "unable to undeploy"
> 
>  
> 
> Hi,
> 
> The first time after reboot that I train and deploy my PIO app everything 
> works well. However, if I then retrain and deploy again, I get the following 
> error:
> 
> 
> [INFO] [MasterActor] Undeploying any existing engine instance at 
> http://0.0.0.0:8000 
> [ERROR] [MasterActor] Another process might be occupying 0.0.0.0:8000 
> . Unable to undeploy.
> [ERROR] [TcpListener] Bind failed for TCP channel on endpoint [/0.0.0.0:8000 
> ]
> [WARN] [HttpListener] Bind to /0.0.0.0:8000  failed
> [ERROR] [MasterActor] Bind failed. Retrying... (2 more trial(s))
> [WARN] [HttpListener] Bind to /0.0.0.0:8000  failed
> [ERROR] [TcpListener] Bind failed for TCP channel on endpoint [/0.0.0.0:8000 
> ]
> [ERROR] [MasterActor] Bind failed. Retrying... (1 more trial(s))
> [ERROR] [TcpListener] Bind failed for TCP channel on endpoint [/0.0.0.0:8000 
> ]
> [WARN] [HttpListener] Bind to /0.0.0.0:8000  failed
> [ERROR] [MasterActor] Bind failed. Retrying... (0 more trial(s))
> [ERROR] [TcpListener] Bind failed for TCP channel on endpoint [/0.0.0.0:8000 
> ]
> [WARN] [HttpListener] Bind to /0.0.0.0:8000  failed
> [ERROR] [MasterActor] Bind failed. Shutting down.
> 
>  
> 
> I thought it was possible to retrain an app that was running and then deploy 
> again.
> 
> Is this not possible?
> 
>  
> 
> How can I kill the running instance?
> 
> I've tried the trick in handmade's integration test but it doesn't work:
> 
>  
> 
> deploy_pid=`jps -lm | grep "onsole deploy" | cut -f 1 -d ' '`
> echo "Killing the deployed test PredictionServer"
> kill "$deploy_pid"
> 
>  
> 
> I still get the same error after doing this.
> 
>  
> 
> Any help is much appreciated.
> 
> Best regards,
> 
> Noelia
> 
> 
> 
> 
> 
> 
> 
> 
> 




-- 
 

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

no...@vicomtech.org 
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

   
  

member of:   

Legal Notice - Privacy policy

Re: New Website

2017-12-13 Thread Pat Ferrel

I guess I’m ok with that since the overall site is such a huge improvement but 
please don’t go back to the old logo for the launch, the color schemes don’t 
match and that will ruin the effect of the new design. If you ask 
startbootstrap I bet they agree.

Ship it, there will be lots of changes later, if only content updates.
+1 

On Dec 13, 2017, at 12:38 PM, Andrew Palumbo  wrote:

I apologize.. I've been in a back to back meetings all week.. so am hectic..but 
as far as separating the vote, my thinking is just ship site as is and then 
swap out logo if we have -1s on it.

Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Andrew Palumbo 
Date: 12/13/2017 12:35 (GMT-08:00)
To: dev@mahout.apache.org
Subject: RE: New Website

I am +1 on the site absolutely.  I suggest that we seperate the vote  on the 
logo and the site.

Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Pat Ferrel 
Date: 12/13/2017 09:47 (GMT-08:00)
To: dev@mahout.apache.org
Subject: Re: New Website

Due to 8 years of Ruby cruft I can’t get the Jeckyll site running without some 
major jackhammering. I can’t post a screenshot but here is the proposed logo.

https://github.com/apache/mahout/blob/mahout-1981/website/assets/mahout-logo.svg

<https://github.com/apache/mahout/blob/mahout-1981/website/assets/mahout-logo.svg>

I encourage people to look at all of this and be judicious with -1s. This has 
been a lot of work, much of the design volunteered by folks at 
startbootstrap.com. IMO the design is awesome. It will put a good, modern, 
clean face on the new Mahout.

The logo is a simple cube, not my favorite but I’m not going to -1 my favorite 
was the M/infinity symbol. If the logo is meant to be a hypercube there are 
simple ways to illustrate it like some form of this:

https://sarcasticresonance.files.wordpress.com/2017/01/cubes1.png?w=721&zoom=2 
<https://sarcasticresonance.files.wordpress.com/2017/01/cubes1.png?w=721&zoom=2>

On Dec 6, 2017, at 11:27 AM, Pat Ferrel  wrote:

Since you’ve already built it can you share a screen shot? The mockup I saw on 
Slack looked awesome.

Also a logo change is a lot more far reaching so can we have at least a little 
discussion?

On Dec 6, 2017, at 10:18 AM, Andrew Musselman  
wrote:

+1, looks great

On Wed, Dec 6, 2017 at 7:43 AM, Trevor Grant 
wrote:

> Hey all,
> 
> The new website is available by checking out the mahout-1981 branch.
> 
> If anyone interested wants to help do QA on it-
> 
> 
> Follow these instructions
> https://github.com/apache/mahout/blob/mahout-1981/
> website/developers/how-to-update-the-website.md
> 
> The only difference is, until we merge- you need to checkout mahout-1981 to
> see the new site.
> 
> I've been working on getting all of the links working /etc.
> 
> Would like to plan on launching Monday, if no objections. That gives
> everyone a chance have a look.
> 
> Also, even if a typo or broken link slips through- updating the website is
> easier than ever for committers and contributors alike (after we launch the
> new site).  One simply opens a PR against master, and then when merged, the
> site automatically updates!
> 
> Thanks,
> tg
>

Re: New Website

2017-12-13 Thread Pat Ferrel

Due to 8 years of Ruby cruft I can’t get the Jeckyll site running without some 
major jackhammering. I can’t post a screenshot but here is the proposed logo.

https://github.com/apache/mahout/blob/mahout-1981/website/assets/mahout-logo.svg

<https://github.com/apache/mahout/blob/mahout-1981/website/assets/mahout-logo.svg>

I encourage people to look at all of this and be judicious with -1s. This has 
been a lot of work, much of the design volunteered by folks at 
startbootstrap.com. IMO the design is awesome. It will put a good, modern, 
clean face on the new Mahout.

The logo is a simple cube, not my favorite but I’m not going to -1 my favorite 
was the M/infinity symbol. If the logo is meant to be a hypercube there are 
simple ways to illustrate it like some form of this:

https://sarcasticresonance.files.wordpress.com/2017/01/cubes1.png?w=721&zoom=2 
<https://sarcasticresonance.files.wordpress.com/2017/01/cubes1.png?w=721&zoom=2>

On Dec 6, 2017, at 11:27 AM, Pat Ferrel  wrote:

Since you’ve already built it can you share a screen shot? The mockup I saw on 
Slack looked awesome.

Also a logo change is a lot more far reaching so can we have at least a little 
discussion?

On Dec 6, 2017, at 10:18 AM, Andrew Musselman  
wrote:

+1, looks great

On Wed, Dec 6, 2017 at 7:43 AM, Trevor Grant 
wrote:

> Hey all,
> 
> The new website is available by checking out the mahout-1981 branch.
> 
> If anyone interested wants to help do QA on it-
> 
> 
> Follow these instructions
> https://github.com/apache/mahout/blob/mahout-1981/
> website/developers/how-to-update-the-website.md
> 
> The only difference is, until we merge- you need to checkout mahout-1981 to
> see the new site.
> 
> I've been working on getting all of the links working /etc.
> 
> Would like to plan on launching Monday, if no objections. That gives
> everyone a chance have a look.
> 
> Also, even if a typo or broken link slips through- updating the website is
> easier than ever for committers and contributors alike (after we launch the
> new site).  One simply opens a PR against master, and then when merged, the
> site automatically updates!
> 
> Thanks,
> tg
>

Re: How can I limit RAM usage?

2017-12-13 Thread Pat Ferrel

Spark uses a minimum of 2g per executor with no data or work. There is always 
one executor and one driver with Spark. Welcome to Big Data.

A 4g server is not sufficient to run Spark let alone the rest of the PIO stack. 
16g minimum is my recommendation—and I do mean minimum. Machine learning is not 
an application you can squeeze into small RAM machines.
 

On Dec 13, 2017, at 4:14 AM, Noelia Osés Fernández  wrote:

I should mention that when I reboot the server and before I start PIO 3GB of 
RAM are already used, so there is only 1GB free RAM for PIO. 
 

On 13 December 2017 at 13:08, Noelia Osés Fernández mailto:no...@vicomtech.org>> wrote:

Hi all,

I have a cloud server with 4GB RAM in which I have installed PIO 
0.12.0-incubating and a few other things. However, I'm having trouble running 
even the smallest examples as it runs out of RAM (plus the swap is also full).

I have set a limit for ES in jvm.options as follows:
-Xms512m
-Xmx512m

Before training and deploying I limit java's RAM usage as follows:

export JAVA_OPTS="-Xmx1g -Xms1g -Dfile.encoding=UTF-8"
pio train  -- --driver-memory 1G --executor-memory 1G
nohup pio deploy > deploy.out &

Are there any more measures I can take to limit RAM usage?

I would like to know I've done absolutely everything possible before paying to 
get more RAM on the server.

Thank you for your help!
Noelia


 



-- 
 

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

no...@vicomtech.org 
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

   
  

member of:   

Legal Notice - Privacy policy

Re: User features to tailor recs in UR queries?

2017-12-12 Thread Pat Ferrel

In our experiments profile attributes have very little benefit if at all. Yes 
you can do that but you have to do use some advanced techniques to choose an 
LLR threshold or the model is likely (with default tuning values) to have a 
100% density, meaning both genders like the item. This is an effect of the 
default tuning, which bypasses threshold calculation because it is no needed in 
most data but a Gender has only 2 possible values and the default tuning allows 
50. Even if you said choose only 1, the difference in LLR score may be 
insignificant. 

If you have a strong gender preference for items in your data it might be worth 
the t-digest & cross-validation tests but again in our experiments there are 
2-3 very helpful secondary indicators and a whole lot of useless ones.

Pick a few things that show a user’s taste, like search terms, browsing 
behavior (detailed product page views), along with your primary indicator and 
start there. Create a baseline cross-validation score with a gold-standard 
dataset. Then add to it to see if the score improves or not. You should A/B 
test even when cross-validation seem to improve.

In several experiments it seems the more indicators you have the more you see 
diminishing returns. We got a 26% lift by using several indicators on the 
rottentomatoes movie review recommender but the las few only gave fractions of 
a %. 26% over using “like” alone.

https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/ 
<https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/>



On Dec 12, 2017, at 1:29 AM, Noelia Osés Fernández  wrote:


Thank you Pat!

So if I'm understanding correctly, I could set a user profile property as 
follows:

{
   "event" : "$set",
   "entityType" : "user",
   "entityId" : "u1234",
   "properties" : {
  "gender": "female"
   },
   "eventTime" : "2015-10-05T21:02:49.228Z"
}

Although this is not recommended. Right?

On 5 December 2017 at 17:38, Pat Ferrel mailto:p...@occamsmachete.com>> wrote:
The User’s possible indicators of taste are encoded in the usage data. Gender 
and other “profile" type data can be encoded a (user-id, gender, gender-id) but 
this is used and a secondary indicator, not as a filter. Only item properties 
are used a filters for some very practical reasons. For one thing items are 
what you are recommending so you would have to establish some relationship 
between items and gender of buyers. The UR does this with user data in 
secondary indicators but does not filter by these because they are calculated 
properties, not ones assigned by humans, like “in-stock” or “language”

Location is an easy secondary indicator but needs to be encoded with “areas” 
not lat/lon, so something like (user-id, location-of-purchase, 
country-code+postal-code) This would be triggered when a primary event happens, 
such as a purchase. This way locaiton is accounted for in making 
recommendations without your haveing to do anything but feed in the data.

Lat/lon roximity filters are not implemented but possible.

One thing to note is that fields used to filter or boost are very different 
than user taste indicators. For one thing they are never tested for correlation 
with the primary event (purchase, read, watch,…) so they can be very dangerous 
to use unwisely. They are best used for business rules like only show 
“in-stock” or in this video carousel show only video of the “mystery” genre. 
But if you use user profile data to filter recommendation you can distort what 
is returned and get bad results. We once had a client that waanted to do this 
against out warnings, filtering by location, gender, and several other things 
known about the user and got 0 lift in sales. We convinced they to try without 
the “business rules” and got good lift in sales. User taste indicators are best 
left to the correlation test by inputting them as user indicator data—except 
where you purposely want to reduce the recommendations to a subset for a 
business reason.

Piut more simply, business rules can kill the value of a recommender, let it 
figure out whether and indicator matters. And always remember that indicators 
apply to users, filters and boosts apply to items and known properties of 
items. It may seem like genre is both a user taste indicator and an item 
property but if you input them in 2 ways they can be used in 2 ways. 1) to make 
better recommendations, 2) in business rules. They are stored and used in 
completely different ways.



On Dec 5, 2017, at 7:59 AM, Noelia Osés Fernández mailto:no...@vicomtech.org>> wrote:

Hi all,

I have seen how to use item properties in queries to tailor the recommendations 
returned by the UR.

But I was wondering whether it is possible to use user characteristics to do 
the same. For example, I want to query for recs fr

Re: New Website

2017-12-06 Thread Pat Ferrel

Since you’ve already built it can you share a screen shot? The mockup I saw on 
Slack looked awesome.

Also a logo change is a lot more far reaching so can we have at least a little 
discussion?


On Dec 6, 2017, at 10:18 AM, Andrew Musselman  
wrote:

+1, looks great

On Wed, Dec 6, 2017 at 7:43 AM, Trevor Grant 
wrote:

> Hey all,
> 
> The new website is available by checking out the mahout-1981 branch.
> 
> If anyone interested wants to help do QA on it-
> 
> 
> Follow these instructions
> https://github.com/apache/mahout/blob/mahout-1981/
> website/developers/how-to-update-the-website.md
> 
> The only difference is, until we merge- you need to checkout mahout-1981 to
> see the new site.
> 
> I've been working on getting all of the links working /etc.
> 
> Would like to plan on launching Monday, if no objections. That gives
> everyone a chance have a look.
> 
> Also, even if a typo or broken link slips through- updating the website is
> easier than ever for committers and contributors alike (after we launch the
> new site).  One simply opens a PR against master, and then when merged, the
> site automatically updates!
> 
> Thanks,
> tg
>

Re: User features to tailor recs in UR queries?

2017-12-05 Thread Pat Ferrel

The User’s possible indicators of taste are encoded in the usage data. Gender
and other “profile" type data can be encoded a (user-id, gender, gender-id) but
this is used and a secondary indicator, not as a filter. Only item properties
are used a filters for some very practical reasons. For one thing items are
what you are recommending so you would have to establish some relationship
between items and gender of buyers. The UR does this with user data in
secondary indicators but does not filter by these because they are calculated
properties, not ones assigned by humans, like “in-stock” or “language”

Location is an easy secondary indicator but needs to be encoded with “areas”
not lat/lon, so something like (user-id, location-of-purchase,
country-code+postal-code) This would be triggered when a primary event happens,
such as a purchase. This way locaiton is accounted for in making
recommendations without your haveing to do anything but feed in the data.

Lat/lon roximity filters are not implemented but possible.

One thing to note is that fields used to filter or boost are very different
than user taste indicators. For one thing they are never tested for correlation
with the primary event (purchase, read, watch,…) so they can be very dangerous
to use unwisely. They are best used for business rules like only show
“in-stock” or in this video carousel show only video of the “mystery” genre.
But if you use user profile data to filter recommendation you can distort what
is returned and get bad results. We once had a client that waanted to do this
against out warnings, filtering by location, gender, and several other things
known about the user and got 0 lift in sales. We convinced they to try without
the “business rules” and got good lift in sales. User taste indicators are best
left to the correlation test by inputting them as user indicator data—except
where you purposely want to reduce the recommendations to a subset for a
business reason.

Piut more simply, business rules can kill the value of a recommender, let it
figure out whether and indicator matters. And always remember that indicators
apply to users, filters and boosts apply to items and known properties of
items. It may seem like genre is both a user taste indicator and an item
property but if you input them in 2 ways they can be used in 2 ways. 1) to make
better recommendations, 2) in business rules. They are stored and used in
completely different ways.

On Dec 5, 2017, at 7:59 AM, Noelia Osés Fernández wrote:

Hi all,

I have seen how to use item properties in queries to tailor the recommendations
returned by the UR.

But I was wondering whether it is possible to use user characteristics to do
the same. For example, I want to query for recs from the UR but only taking
into account the history of users that are female (or only using the history of
users in the same county). Is this possible to do?

I've been reading the UR docs but couldn't find info about this.

Thank you very much!

Best regards,
Noelia

--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to actionml-user+unsubscr...@googlegroups.com
.
To post to this group, send email to actionml-u...@googlegroups.com
.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/CAMysefu-8mOgh3NsRkRVN6H6bRm6hR%2B1HuryT4wqgtXZD3norg%40mail.gmail.com

.
For more options, visit https://groups.google.com/d/optout
.

Re: Data lost from HBase to DataSource

2017-11-29 Thread Pat Ferrel

1596 is how many events were accepted by the EventServer, look at the exported 
format and compare with the ones you imported. There must be a formatting error 
or an error when importing (did you check responses for each event import?)

Looking below I see you are importing JPEG??? This is almost always a bad idea. 
Image data is usually kept in a filesystems like HDFS and a reference kept in 
the DB, there are too may serialization questions to do otherwise in my 
experience. If your Engine requires this you are asking for the kind of trouble 
you are seeing.


On Nov 28, 2017, at 7:16 PM, Huang, Weiguang  wrote:

Hi Pat,
 
Here is the result when we tried out your suggestion.
 
We checked the data from the Hbase, and the count of the records is exactly the 
same as we imported into the Hbase, that is 6500.
2017-11-29 10:42:19 INFO  DAGScheduler:54 - Job 0 finished: count at 
ImageDataFromHBaseChecker.scala:27, took 12.016679 s
Number of Records found : 6500
 
We exported data from Pio and checked, but got only 1596 – see at the bottom of 
the below screen record.
$ ls -al
total 412212
drwxr-xr-x  2 root root  4096 Nov 29 02:48 .
drwxr-xr-x 23 root root  4096 Nov 29 02:48 ..
-rw-r--r--  1 root root 8 Nov 29 02:48 ._SUCCESS.crc
-rw-r--r--  1 root root817976 Nov 29 02:48 .part-0.crc
-rw-r--r--  1 root root817976 Nov 29 02:48 .part-1.crc
-rw-r--r--  1 root root817976 Nov 29 02:48 .part-2.crc
-rw-r--r--  1 root root817976 Nov 29 02:48 .part-3.crc
-rw-r--r--  1 root root 0 Nov 29 02:48 _SUCCESS
-rw-r--r--  1 root root 104699844 Nov 29 02:48 part-0
-rw-r--r--  1 root root 104699877 Nov 29 02:48 part-1
-rw-r--r--  1 root root 104699843 Nov 29 02:48 part-2
-rw-r--r--  1 root root 104699863 Nov 29 02:48 part-3
$ wc -l part-0
399 part-0
$ wc -l part-1
399 part-1
$ wc -l part-2
399 part-2
$ wc -l part-3
399 part-3
That is 399 * 4 = 1596
 
Is this data lost caused by schema changed, or ill data contents, or other 
possible reasons? Appreciate for your thoughts.
 
Thanks,
Weiguang
  <>
 <>From: Pat Ferrel [mailto:p...@occamsmachete.com] 
Sent: Wednesday, November 29, 2017 10:16 AM
To: user@predictionio.apache.org
Cc: u...@predictionio.incubator.apache.org
Subject: Re: Data lost from HBase to DataSource
 
Try my suggestion with export and see if the number of events looks correct. I 
am suggesting that you may not be counting what you think you are using HBase.
 
 
On Nov 28, 2017, at 5:53 PM, Huang, Weiguang mailto:weiguang.hu...@intel.com>> wrote:
 
Hi Pat,
 
Thanks for your advice.  However, we are not using HBase directly. We use pio 
to import data into HBase by below command:
pio import --appid 7 --input hdfs://[host]:9000/pio/  
applicationName /recordFile.json
Could things go wrong here or somewhere else?
 
Thanks,
Weiguang
From: Pat Ferrel [mailto:p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>] 
Sent: Tuesday, November 28, 2017 11:54 PM
To: user@predictionio.apache.org <mailto:user@predictionio.apache.org>
Cc: u...@predictionio.incubator.apache.org 
<mailto:u...@predictionio.incubator.apache.org>
Subject: Re: Data lost from HBase to DataSource
 
It is dangerous to use HBase directly because the schema may change at any 
time. Export the data as json and examine it there. To see how many events are 
in the stream you can just export then using bash to count lines (wc -l). Each 
line is a JSON event. Or import the data as a dataframe in Spark and use Spark 
SQL. 
 
There is no published contract about how events are stored in HBase.
 
 
On Nov 27, 2017, at 9:24 PM, Sachin Kamkar mailto:sachinkam...@gmail.com>> wrote:
 
We are also facing the exact same issue. We have confirmed 1.5 million records 
in HBase. However, I see only 19k records being fed for training 
(eventsRDD.count()).

With Regards,
 
 Sachin
⚜KTBFFH⚜
 
On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang mailto:weiguang.hu...@intel.com>> wrote:
Hi guys,
 
I have encoded some JPEG images in json and imported to HBase, which shows 6500 
records. When I read those data in DataSource with Pio, however only some 1500 
records were fed in PIO.
I use PEventStore.find(appName, entityType, eventNames), and all the records 
have  the same entityType, eventNames.
 
Any idea what could go wrong? The encoded string from JPEG is very wrong, 
hundreds of thousands of characters, could this be a reason for the data lost?
 
Thank you for looking into my question.
 
Best,
Weiguang

Re: Data lost from HBase to DataSource

2017-11-28 Thread Pat Ferrel

Try my suggestion with export and see if the number of events looks correct. I 
am suggesting that you may not be counting what you think you are using HBase.


On Nov 28, 2017, at 5:53 PM, Huang, Weiguang  wrote:

Hi Pat,
 
Thanks for your advice.  However, we are not using HBase directly. We use pio 
to import data into HBase by below command: <>
pio import --appid 7 --input hdfs://[host]:9000/pio/  
applicationName /recordFile.json
Could things go wrong here or somewhere else?
 
Thanks,
Weiguang
 <>From: Pat Ferrel [mailto:p...@occamsmachete.com] 
Sent: Tuesday, November 28, 2017 11:54 PM
To: user@predictionio.apache.org
Cc: u...@predictionio.incubator.apache.org
Subject: Re: Data lost from HBase to DataSource
 
It is dangerous to use HBase directly because the schema may change at any 
time. Export the data as json and examine it there. To see how many events are 
in the stream you can just export then using bash to count lines (wc -l). Each 
line is a JSON event. Or import the data as a dataframe in Spark and use Spark 
SQL. 
 
There is no published contract about how events are stored in HBase.
 
 
On Nov 27, 2017, at 9:24 PM, Sachin Kamkar mailto:sachinkam...@gmail.com>> wrote:
 
We are also facing the exact same issue. We have confirmed 1.5 million records 
in HBase. However, I see only 19k records being fed for training 
(eventsRDD.count()).

With Regards,
 
 Sachin
⚜KTBFFH⚜
 
On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang mailto:weiguang.hu...@intel.com>> wrote:
Hi guys,
 
I have encoded some JPEG images in json and imported to HBase, which shows 6500 
records. When I read those data in DataSource with Pio, however only some 1500 
records were fed in PIO.
I use PEventStore.find(appName, entityType, eventNames), and all the records 
have  the same entityType, eventNames.
 
Any idea what could go wrong? The encoded string from JPEG is very wrong, 
hundreds of thousands of characters, could this be a reason for the data lost?
 
Thank you for looking into my question.
 
Best,
Weiguang

Re: Error with a Cluster : ConnectionRefused

2017-11-28 Thread Pat Ferrel

We (ActionML) do contract and consulting based on PIO and I can assure you from many installations that it works quite well with clustered HDFS, HBase, Spark, and Elasticsearch. For truly scalable setup I’d recommend clusters for Spark, HBase+HDFS, and ES 5.x. So 3 different ones.If you are using the ActionML setup instructions you must be using the Universal Recommender? This was broken by the latest release of PIO 0.12.0 but we have an RC that will be released with the next Mahout release (should be next week). You can use it now by following some interim build instructions (see the UR-0.7.0-SNAPSHOT readme here: https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT). BTW the page you reference is for PIO 0.11.0 and is being updated as I write.If you are using the UR you do not want an HDFS storage backend, which is being checked by `pio status`. Can you share your pio-env? Even if you are not using the UR my theory is that something is setup in pio-env wrong because using clustered HDFS for a store is quite typical and well tested.On Nov 28, 2017, at 8:49 AM, Thibaut Gensollen - Choose  wrote:Hi Guys,I tried to set up a Predictionio cluster, working with an elasticsearch and hadoop clusters. The fact is, everything is working well (Hbase, ES or Hadoop/Spark), and as soon as I tried « pio-start-all » I am getting these errors:aml@master:~$ pio status
[INFO] [Management$] Inspecting PredictionIO...
[INFO] [Management$] PredictionIO 0.12.0-incubating is installed at /home/aml
[INFO] [Management$] Inspecting Apache Spark...
[INFO] [Management$] Apache Spark is installed at 
/opt/spark/spark-1.6.3-bin-hadoop2.6
[INFO] [Management$] Apache Spark 1.6.3 detected (meets minimum requirement of 
1.3.0)
[INFO] [Management$] Inspecting storage backend connections...
[INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)...
[INFO] [Storage$] Verifying Model Data Backend (Source: HDFS)...
[ERROR] [Storage$] Error initializing storage client for source HDFS.
java.lang.IllegalArgumentException: Wrong FS: 
hdfs://master.c.choose-ninja-01.internal:9000/models, expected: file:
///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setWorkingDirectory(RawLocalFileSystem.java:547)
at 
org.apache.hadoop.fs.FilterFileSystem.setWorkingDirectory(FilterFileSystem.java:280)
at 
org.apache.predictionio.data.storage.hdfs.StorageClient.(StorageClient.scala:33)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.predictionio.data.storage.Storage$.getClient(Storage.scala:252)
at 
org.apache.predictionio.data.storage.Storage$.org$apache$predictionio$data$storage$Storage$$updateS2CM(S
torage.scala:283)
at 
org.apache.predictionio.data.storage.Storage$$anonfun$sourcesToClientMeta$1.apply(Storage.scala:244)
at 
org.apache.predictionio.data.storage.Storage$$anonfun$sourcesToClientMeta$1.apply(Storage.scala:244)
at 
scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:194)
at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:80)
at 
org.apache.predictionio.data.storage.Storage$.sourcesToClientMeta(Storage.scala:244)
at 
org.apache.predictionio.data.storage.Storage$.getDataObject(Storage.scala:315)
at 
org.apache.predictionio.data.storage.Storage$.getDataObjectFromRepo(Storage.scala:300)
at 
org.apache.predictionio.data.storage.Storage$.getModelDataModels(Storage.scala:442)
at 
org.apache.predictionio.data.storage.Storage$.verifyAllDataObjects(Storage.scala:381)
at 
org.apache.predictionio.tools.commands.Management$.status(Management.scala:156)
at org.apache.predictionio.tools.console.Pio$.status(Pio.scala:155)
at 
org.apache.predictionio.tools.console.Console$$anonfun$main$1.apply(Console.scala:721)
at 
org.apache.predictionio.tools.console.Console$$anonfun$main$1.apply(Console.scala:656)
at scala.Option.map(Option.scala:146)
at 
org.apache.predictionio.tools.console.Console$.main(Console.scala:656)
at org.apache.predictionio.tools.console.Console.main(Console.scala)

[ERROR] [Management$] Unable to connect to all storage backends successfully.
The following shows the error message from the storage backend.
Data source HDFS was not properly initialized. 
(org.apache.predictionio.data.storage.StorageClientException)
Dumping configuration of initialized storage backend sources.
Please make sure they are correct.
Source Name: ELASTICSEARCH; Type: elasticsearch; Configuration: TYPE -> 
elasticsearch, HOM

Re: Data lost from HBase to DataSource

2017-11-28 Thread Pat Ferrel

It is dangerous to use HBase directly because the schema may change at any 
time. Export the data as json and examine it there. To see how many events are 
in the stream you can just export then using bash to count lines (wc -l). Each 
line is a JSON event. Or import the data as a dataframe in Spark and use Spark 
SQL. 

There is no published contract about how events are stored in HBase.


On Nov 27, 2017, at 9:24 PM, Sachin Kamkar  wrote:

We are also facing the exact same issue. We have confirmed 1.5 million records 
in HBase. However, I see only 19k records being fed for training 
(eventsRDD.count()).

With Regards,

 Sachin
⚜KTBFFH⚜

On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang mailto:weiguang.hu...@intel.com>> wrote:
Hi guys,

 

I have encoded some JPEG images in json and imported to HBase, which shows 6500 
records. When I read those data in DataSource with Pio, however only some 1500 
records were fed in PIO.

I use PEventStore.find(appName, entityType, eventNames), and all the records 
have  the same entityType, eventNames.

 

Any idea what could go wrong? The encoded string from JPEG is very wrong, 
hundreds of thousands of characters, could this be a reason for the data lost?

 

Thank you for looking into my question.

 

Best,

Weiguang

[jira] [Assigned] (MAHOUT-2023) Drivers broken, scopt classes not found

2017-11-27 Thread Pat Ferrel (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel reassigned MAHOUT-2023:
--

Assignee: Trevor Grant  (was: Pat Ferrel)

> Drivers broken, scopt classes not found
> ---
>
> Key: MAHOUT-2023
> URL: https://issues.apache.org/jira/browse/MAHOUT-2023
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
> Environment: any
>Reporter: Pat Ferrel
>Assignee: Trevor Grant
>Priority: Blocker
> Fix For: 0.13.1
>
>
> Type `mahout spark-itemsimilarity` after Mahout is installed properly and you 
> get a fatal exception due to missing scopt classes.
> Probably a build issue related to incorrect versions of scopt being looked 
> for.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MAHOUT-2023) Drivers broken, scopt classes not found

2017-11-27 Thread Pat Ferrel (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267877#comment-16267877
 ] 

Pat Ferrel commented on MAHOUT-2023:


This is a big issue. It shows up when you run a Spark CLI but also seems to 
affect GPU bindings written in Scala, disabling both. The fix is somewhere in 
the build system afaict.

> Drivers broken, scopt classes not found
> ---
>
> Key: MAHOUT-2023
> URL: https://issues.apache.org/jira/browse/MAHOUT-2023
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
> Environment: any
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.1
>
>
> Type `mahout spark-itemsimilarity` after Mahout is installed properly and you 
> get a fatal exception due to missing scopt classes.
> Probably a build issue related to incorrect versions of scopt being looked 
> for.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Re: Prepping Release

2017-11-27 Thread Pat Ferrel

https://issues.apache.org/jira/browse/MAHOUT-2023 
 is the only blocker I see. 
It’s a big one since it make drivers and GPU bindings not work in clusters (I 
think). But the fix is probably easy.

On Nov 27, 2017, at 8:06 AM, Jim Jagielski  wrote:

Looks good to me! Thx!

> On Nov 26, 2017, at 11:59 AM, Trevor Grant  wrote:
> 
> Hey all-
> 
> Making another run at prepping a 0.13.1 Release.
> 
> Please see
> https://issues.apache.org/jira/projects/MAHOUT/versions/12339149
> 
> If anyone has any other issues they think need to be addressed before
> 0.13.1 please make sure the "Affects Version" On the JIRA ticket is
> correctly set, and list "type" as a blocker.
> 
> I think most of the issues on that list are more or less taken care of,
> assuming no other blockers sneak up, will be calling "code freeze" mid
> week.
> 
> Thanks!
> tg

[jira] [Resolved] (MAHOUT-2020) Maven repo structure malformed

2017-11-27 Thread Pat Ferrel (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel resolved MAHOUT-2020.

Resolution: Fixed

Trevor found a script in Spark that seems to fix this when used during a build. 
Marking as fixed but we need to document this for source builds.

> Maven repo structure malformed
> --
>
> Key: MAHOUT-2020
> URL: https://issues.apache.org/jira/browse/MAHOUT-2020
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
> Environment: Creating a project from maven built Mahout using sbt. 
> Made critical since it seems to block using Mahout with sbt. At least I have 
> found no way to do it.
>Reporter: Pat Ferrel
>Assignee: Trevor Grant
>Priority: Blocker
> Fix For: 0.13.1
>
>
> The maven repo is built with scala 2.10 always in the parent pom's 
> {scala.compat.version} even when you only ask for Scala 2.11, this leads to 
> the 2.11 jars never being found. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Re: universal recommender evaluation

2017-11-24 Thread Pat Ferrel

For the UR we have scripts that do this instead of the Evaluation APIs, which 
are pretty limited and to not do what we want, which is hyper-prameter search. 
This requires changing some params that require models to be re-created, and 
other tests vary query params. All these are only possible with scripts that 
control the whole system from the outside.


On Nov 24, 2017, at 9:42 AM, Pat Ferrel  wrote:

Yes, this is what we do. We split by date into 10-90 or 20-80. The metric we 
use is MAP@k for precision and as a proxy for recall we look at the % of people 
in the test set that get recs (turn off popularity backfill or everyone will 
get some kind of recs, if only popular ones. The more independent events you 
have in the data the larger your recall number will be. Expect small precision 
numbers, they are on average but larger is better. Do not use it to compare 
different algorithms, only A/B tests work for that no matter what the academics 
do. Use your cross-validation scores to compare tunings. Start with the default 
for everything as your baseline and tune from there.


On Nov 24, 2017, at 12:54 AM, Andy Rao  wrote:

Hi, 

I have successfully trained our rec model using universal recommender, but I do 
not know how to evaluate the trained model. 

The first idea come from my head is to split our dataset into train and test 
dataset, and then use recall metrics evaluate. But I'm not sure whether this is 
a good idea or not.

Any help or suggestion is much appreciated.
Hongyao

Re: universal recommender evaluation

2017-11-24 Thread Pat Ferrel

Yes, this is what we do. We split by date into 10-90 or 20-80. The metric we 
use is MAP@k for precision and as a proxy for recall we look at the % of people 
in the test set that get recs (turn off popularity backfill or everyone will 
get some kind of recs, if only popular ones. The more independent events you 
have in the data the larger your recall number will be. Expect small precision 
numbers, they are on average but larger is better. Do not use it to compare 
different algorithms, only A/B tests work for that no matter what the academics 
do. Use your cross-validation scores to compare tunings. Start with the default 
for everything as your baseline and tune from there.


On Nov 24, 2017, at 12:54 AM, Andy Rao  wrote:

Hi, 

I have successfully trained our rec model using universal recommender, but I do 
not know how to evaluate the trained model. 

The first idea come from my head is to split our dataset into train and test 
dataset, and then use recall metrics evaluate. But I'm not sure whether this is 
a good idea or not.

Any help or suggestion is much appreciated.
Hongyao

Re: Log-likelihood based correlation test?

2017-11-23 Thread Pat Ferrel

Use the default. Tuning with a threshold is only for atypical data and unless 
you have a harness for cross-validation you would not know if you were making 
things worse or better. We have our own tools for this but have never had the 
need for threshold tuning. 

Yes, llrDownsampled(PtP) is the “model”, each doc put into Elasticsearch is a 
sparse representation of a row from it, along with those from PtV, PtC,… Each 
gets a “field” in the doc.


On Nov 22, 2017, at 6:16 AM, Noelia Osés Fernández  wrote:

Thanks Pat!

How can I tune the threshold?

And when you say "compare to each item in the model", do you mean each row in 
PtP?

On 21 November 2017 at 19:56, Pat Ferrel mailto:p...@occamsmachete.com>> wrote:
No PtP non-zero elements have LLR calculated. The highest scores in the row are 
kept, or ones above some threshold hte resst are removeda as “noise". These are 
put into the Elasticsearch model without scores. 

Elasticsearch compares the similarity of the user history to each item in the 
model to find the KNN similar ones. This uses OKAPI BM25 from Lucene, which has 
several benefits over pure cosines (it actually consists of adjustments to 
cosine) and we also use norms. With ES 5 we should see quality improvements due 
to this. 
https://www.elastic.co/guide/en/elasticsearch/guide/master/pluggable-similarites.html
 
<https://www.elastic.co/guide/en/elasticsearch/guide/master/pluggable-similarites.html>



On Nov 21, 2017, at 1:28 AM, Noelia Osés Fernández mailto:no...@vicomtech.org>> wrote:

Pat,

If I understood your explanation correctly, you say that some elements of PtP 
are removed by the LLR (set to zero, to be precise). But the elements that 
survive are calculated by matrix multiplication. The final PtP is put into 
EleasticSearc and when we query for user recommendations ES uses KNN to find 
the items (the rows in PtP) that are most similar to the user's history.

If the non-zero elements of PtP have been calculated by straight matrix 
multiplication, and I'm assuming that the P matrix only has 0s and 1s to 
indicate which items have been purchased by which user, then the elements of 
PtP are either 0 or greater to or equal than 1. However, the scores I get are 
below 1.

So is the KNN using cosine similarity as a metric to calculate the closest 
neighbours? And is the results of this cosine similarity metric what is 
returned as a 'score'?

If it is, when it is greater than 1, is this because the different cosine 
similarities are added together i.e. PtP, PtL... ?

Thank you for all your valuable help!

On 17 November 2017 at 19:52, Pat Ferrel mailto:p...@occamsmachete.com>> wrote:
Mahout builds the model by doing matrix multiplication (PtP) then calculating 
the LLR score for every non-zero value. We then keep the top K or use a 
threshold to decide whether to keep of not (both are supported in the UR). LLR 
is a metric for seeing how likely 2 events in a large group are correlated. 
Therefore LLR is only used to remove weak data from the model.

So Mahout builds the model then it is put into Elasticsearch which is used as a 
KNN (K-nearest Neighbors) engine. The LLR score is not put into the model only 
an indicator that the item survived the LLR test.

The KNN is applied using the user’s history as the query and finding items the 
most closely match it. Since PtP will have items in rows and the row will have 
correlating items, this “search” methods work quite well to find items that had 
very similar items purchased with it as are in the user’s history.

=== that is the simple explanation 


Item-based recs take the model items (correlated items by the LLR test) as the 
query and the results are the most similar items—the items with most similar 
correlating items.

The model is items in rows and items in columns if you are only using one 
event. PtP. If you think it through, it is all purchased items in as the row 
key and other items purchased along with the row key. LLR filters out the 
weakly correlating non-zero values (0 mean no evidence of correlation anyway). 
If we didn’t do this it would be purely a “Cooccurrence” recommender, one of 
the first useful ones. But filtering based on cooccurrence strength (PtP values 
without LLR applied to them) produces much worse results than using LLR to 
filter for most highly correlated cooccurrences. You get a similar effect with 
Matrix Factorization but you can only use one type of event for various reasons.

Since LLR is a probabilistic metric that only looks at counts, it can be 
applied equally well to PtV (purchase, view), PtS (purchase, search terms), PtC 
(purchase, category-preferences). We did an experiment using Mean Average 
Precision for the UR using video “Likes” vs “Likes” and “Dislikes” so LtL vs. 
LtL and LtD scraped from rottentomatoes.com <http://rottentomatoes.com/> 
reviews and

Re: Total number of events in predictionio are showing less then the actual events

2017-11-23 Thread Pat Ferrel

My vague recollection is that HBase may mark things for removal but wait for 
certain operations before they are compacted. If this is the case I’m sure 
there is a way to get the correct count so this may be a question for the HBase 
list.


On Nov 23, 2017, at 1:51 AM, Abhimanyu Nagrath  
wrote:

Done the same as you have mentioned but problem still ersists




Regards,
Abhimanyu

On Thu, Nov 23, 2017 at 2:53 PM, Александр Лактионов mailto:lokotoc...@gmail.com>> wrote:
Hi Abhimanyu,

try setting TTL for rows in your hbase table
it can be set in hbase shell:
alter 'pio_event:events_?', NAME => 'e', TTL => 
and then do the following in the shell:
major_compact 'pio_event:events_?'

You can configure auto major compact: it will delete all the rows that are 
older than TTL

> 23 нояб. 2017 г., в 12:19, Abhimanyu Nagrath  > написал(а):
> 
> Hi,
> 
> I am stuck at this point .How to identify the problem?
> 
> 
> Regards,
> Abhimanyu
> 
> On Mon, Nov 20, 2017 at 11:08 AM, Abhimanyu Nagrath 
> mailto:abhimanyunagr...@gmail.com>> wrote:
> Hi , I am new to predictionIO V 0.12.0 (elasticsearch - 5.2.1 , hbase - 1.2.6 
> , spark - 2.6.0) Hardware (244 GB RAM and Core - 32) . I have uploaded near 
> about 1 million events(each containing 30k features) . while uploading I can 
> see the size of hbase disk increasing and after all the events got uploaded 
> the size of hbase disk is 567GB. In order to verify I ran the following 
> commands 
> 
>  - pio-shell --with-spark --conf spark.network.timeout=1000 
> --driver-memory 30G --executor-memory 21G --num-executors 7 --executor-cores 
> 3 --conf spark.driver.maxResultSize=4g --conf 
> spark.executor.heartbeatInterval=1000
>  - import org.apache.predictionio.data.store.PEventStore
>  - val eventsRDD = PEventStore.find(appName="test")(sc)
>  - val c = eventsRDD.count() 
> it shows event counts as 18944
> 
> After that from the script through which I uploaded the events, I randomly 
> queried with there events Id and I was getting that event.
> 
> I don't know how to make sure that all the events uploaded by me are there in 
> the app. Any help is appreciated.
> 
> 
> Regards,
> Abhimanyu
>

Re: "pio app delete" breaks my PIO installation

2017-11-21 Thread Pat Ferrel

I’ve seen this happen on my dev machine (a mac laptop) when I use 
`pio-start-all` which I never use in production.

My test for everything running includes 
pio status: this only tests the metadata service and the main store 
(Elasticsearch and HBase in my case) It by no means examines all services 
needed.
hdfs dfs -ls: this tests that the hdfs service is working
curl -i -X GET http://localhost:7070; This tests if the pio EventServer is 
listening on the right port 

Often the curl command hangs so I restart the EventServer but doing 
pio-stop-all, pio-start-all. This actually causes the UR integration test to 
hang if the EventServer is not listening correctly. 

Not sure if this is what Donald is talking about because your storage is 
working if pio status works. But restarting the EventService does the trick 
usually for me.


On Nov 21, 2017, at 11:44 AM, Donald Szeto  wrote:

Hey Noelia,

What event storage backend are you using? Is the backend storage stuck? This 
could happen very often with a local, single-node HBase installation.

Regards,
Donald

On Mon, Nov 13, 2017 at 7:51 AM, Noelia Osés Fernández mailto:no...@vicomtech.org>> wrote:
I forgot to mention that pio status reports my system is all ready to go :(

but it isn't true. I can't import the data.

On 13 November 2017 at 16:47, Noelia Osés Fernández mailto:no...@vicomtech.org>> wrote:
Hi,

It has happened several times already that after I execute:

pio app delete appname

my PIO installation breaks. Does anybody else have this problem?

Particularly, this time I get the following error during data import:

Traceback (most recent call last):
  File "data/import_eventserver.py", line 63, in 
import_events(client, args.file)
  File "data/import_eventserver.py", line 33, in import_events
properties= { "rating" : float(data[2]) }
  File "/usr/local/lib/python2.7/dist-packages/predictionio/__init__.py", line 
247, in create_event
event_time).get_response()
  File "/usr/local/lib/python2.7/dist-packages/predictionio/connection.py", 
line 111, in get_response
self._response = self.rfunc(tmp_response)
  File "/usr/local/lib/python2.7/dist-packages/predictionio/__init__.py", line 
118, in _acreate_resp
(response.error, response.request))
predictionio.NotCreatedError: Exception happened: timed out for request POST 
/events.json?accessKey=XQewLhG4RfqP1zAMH9y3E5c4wd0_vFYRYgQIMX3gxluzNlTI6N_M16z_CjjV9zAY
 {'event': 'rate', 'eventTime': '2017-11-13T15:32:24.506+', 'entityType': 
'user', 'targetEntityId': '31', 'properties': {'rating': 2.5}, 'entityId': '1', 
'targetEntityType': 'item'} 
/events.json?accessKey=XQewLhG4RfqP1zAMH9y3E5c4wd0_vFYRYgQIMX3gxluzNlTI6N_M16z_CjjV9zAY?event=rate&eventTime=2017-11-13T15%3A32%3A24.506%2B&entityType=user&targetEntityId=31&properties=%7B%27rating%27%3A+2.5%7D&entityId=1&targetEntityType=item


This app was working this morning. I deleted it, then created it again and now 
I have this error and can't make it work.

Furthermore, when I execute
curl -i -X GET http://localhost:7070 
the terminal just hangs there, it doesn't print any output nor error messages.

Any help is much appreciated,
Noelia










-- 
 

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

no...@vicomtech.org 
+[34] 943 30 92 30 
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

   
  

member of:   

Legal Notice - Privacy policy

Re: Log-likelihood based correlation test?

2017-11-21 Thread Pat Ferrel

No PtP non-zero elements have LLR calculated. The highest scores in the row are 
kept, or ones above some threshold hte resst are removeda as “noise". These are 
put into the Elasticsearch model without scores. 

Elasticsearch compares the similarity of the user history to each item in the 
model to find the KNN similar ones. This uses OKAPI BM25 from Lucene, which has 
several benefits over pure cosines (it actually consists of adjustments to 
cosine) and we also use norms. With ES 5 we should see quality improvements due 
to this. 
https://www.elastic.co/guide/en/elasticsearch/guide/master/pluggable-similarites.html
 
<https://www.elastic.co/guide/en/elasticsearch/guide/master/pluggable-similarites.html>



On Nov 21, 2017, at 1:28 AM, Noelia Osés Fernández  wrote:

Pat,

If I understood your explanation correctly, you say that some elements of PtP 
are removed by the LLR (set to zero, to be precise). But the elements that 
survive are calculated by matrix multiplication. The final PtP is put into 
EleasticSearc and when we query for user recommendations ES uses KNN to find 
the items (the rows in PtP) that are most similar to the user's history.

If the non-zero elements of PtP have been calculated by straight matrix 
multiplication, and I'm assuming that the P matrix only has 0s and 1s to 
indicate which items have been purchased by which user, then the elements of 
PtP are either 0 or greater to or equal than 1. However, the scores I get are 
below 1.

So is the KNN using cosine similarity as a metric to calculate the closest 
neighbours? And is the results of this cosine similarity metric what is 
returned as a 'score'?

If it is, when it is greater than 1, is this because the different cosine 
similarities are added together i.e. PtP, PtL... ?

Thank you for all your valuable help!

On 17 November 2017 at 19:52, Pat Ferrel mailto:p...@occamsmachete.com>> wrote:
Mahout builds the model by doing matrix multiplication (PtP) then calculating 
the LLR score for every non-zero value. We then keep the top K or use a 
threshold to decide whether to keep of not (both are supported in the UR). LLR 
is a metric for seeing how likely 2 events in a large group are correlated. 
Therefore LLR is only used to remove weak data from the model.

So Mahout builds the model then it is put into Elasticsearch which is used as a 
KNN (K-nearest Neighbors) engine. The LLR score is not put into the model only 
an indicator that the item survived the LLR test.

The KNN is applied using the user’s history as the query and finding items the 
most closely match it. Since PtP will have items in rows and the row will have 
correlating items, this “search” methods work quite well to find items that had 
very similar items purchased with it as are in the user’s history.

=== that is the simple explanation 


Item-based recs take the model items (correlated items by the LLR test) as the 
query and the results are the most similar items—the items with most similar 
correlating items.

The model is items in rows and items in columns if you are only using one 
event. PtP. If you think it through, it is all purchased items in as the row 
key and other items purchased along with the row key. LLR filters out the 
weakly correlating non-zero values (0 mean no evidence of correlation anyway). 
If we didn’t do this it would be purely a “Cooccurrence” recommender, one of 
the first useful ones. But filtering based on cooccurrence strength (PtP values 
without LLR applied to them) produces much worse results than using LLR to 
filter for most highly correlated cooccurrences. You get a similar effect with 
Matrix Factorization but you can only use one type of event for various reasons.

Since LLR is a probabilistic metric that only looks at counts, it can be 
applied equally well to PtV (purchase, view), PtS (purchase, search terms), PtC 
(purchase, category-preferences). We did an experiment using Mean Average 
Precision for the UR using video “Likes” vs “Likes” and “Dislikes” so LtL vs. 
LtL and LtD scraped from rottentomatoes.com <http://rottentomatoes.com/> 
reviews and got a 20% lift in the MAP@k score by including data for “Dislikes”. 
https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/ 
<https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/>

So the benefit and use of LLR is to filter weak data from the model and allow 
us to see if dislikes, and other events, correlate with likes. Adding this type 
of data, that is usually thrown away is one the the most powerful reasons to 
use the algorithm—BTW the algorithm is called Correlated Cross-Occurrence (CCO).

The benefit of using Lucene (at the heart of Elasticsearch) to do the KNN query 
is that is it fast, taking the user’s realtime events into the query but also 
because it is is trivial to add all sorts or business r

[jira] [Commented] (MAHOUT-2023) Drivers broken, scopt classes not found

2017-11-20 Thread Pat Ferrel (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259971#comment-16259971
 ] 

Pat Ferrel commented on MAHOUT-2023:


Yep, the mahout...dependency-reduced.jar excludes anything with 
{{scala.compat.version}} in the name

> Drivers broken, scopt classes not found
> ---
>
> Key: MAHOUT-2023
> URL: https://issues.apache.org/jira/browse/MAHOUT-2023
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
> Environment: any
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.1
>
>
> Type `mahout spark-itemsimilarity` after Mahout is installed properly and you 
> get a fatal exception due to missing scopt classes.
> Probably a build issue related to incorrect versions of scopt being looked 
> for.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (MAHOUT-2023) Drivers broken, scopt classes not found

2017-11-20 Thread Pat Ferrel (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16258255#comment-16258255
 ] 

Pat Ferrel edited comment on MAHOUT-2023 at 11/20/17 10:40 PM:
---

ok not that MAHOUT-2020 is resolved, I looked at the scopt issue and found:
* all the correct scopt artifact exist in remote repos for all scala versions 
and they are being found by the mahout build.
* the ids for artifact etc are correct as per ^^^
* I checked all the tagged versions of Mahout back to 12.0. Not sure when the 
drivers stopped working but there has been no change to any reference to scopt 
in any POM. And since people have been using it and asking questions on the 
mailing list I will assume that up till the last build changes the drivers 
worked.
* The vienna-cl and java to c bindings are in the assembly pom so these classes 
are getting to the Spark Executors.
* I've checked compute-classpath.sh and the mahout script where changes were 
small and not relevant. 
* I've looked at the contents of the mahout*dependency-reduced.jar, which 
should have the things listed below and it does not, in only had guava, apache 
commons and fastutils. It is supposed to have:

{code:xml}

  

  true
  
  
  
 META-INF/LICENSE
  
  
  runtime
  /
  true
  

com.google.guava:guava
com.github.scopt_${scala.compat.version}
com.tdunning:t-digest
org.apache.commons:commons-math3
it.unimi.dsi:fastutil

org.apache.mahout:mahout-native-viennacl_${scala.compat.version}

org.apache.mahout:mahout-native-viennacl-omp_${scala.compat.version}
org.bytedeco:javacpp
  




{code}  

This all leads me to believe that something in the build no longer makes that 
dependency-reduced.jar available to the Java Driver code since those other libs 
in the assembly are probably all hadoop or Spark Executor code, not needed in 
the Mahout driver. This is likely to have been a side effect of the build 
refactoring

[~rawkintrevo_apache] does "dependencies-reduced.jar" which contains Scopt get 
its scala.compat.version fixed? It seems like the jar is missing anything with 
scala.compat.version but this may be a red herring.




was (Author: pferrel):
ok not that MAHOUT-2020 is resolved, I looked at the scopt issue and found:
* all the correct scopt artifact exist in remote repos for all scala versions 
and they are being found by the mahout build.
* the ids for artifact etc are correct as per ^^^
* I checked all the tagged versions of Mahout back to 12.0. Not sure when the 
drivers stopped working but there has been no change to any reference to scopt 
in any POM. And since people have been using it and asking questions on the 
mailing list I will assume that up till the last build changes the drivers 
worked.
* The vienna-cl and java to c bindings are in the assembly pom so these classes 
are getting to the Spark Executors.
* I've checked compute-classpath.sh and the mahout script where changes were 
small and not relevant. 
* I've looked at the contents of the mahout*dependency-reduced.jar, which 
should have the things listed below and it does not, in only had guava, apache 
commons and fastutils. It is supposed to have:

  

  true
  
  
  
 META-INF/LICENSE
  
  
  runtime
  /
  true
  

com.google.guava:guava
com.github.scopt_${scala.compat.version}
com.tdunning:t-digest
org.apache.commons:commons-math3
it.unimi.dsi:fastutil

org.apache.mahout:mahout-native-viennacl_${scala.compat.version}

org.apache.mahout:mahout-native-viennacl-omp_${scala.compat.version}
org.bytedeco:javacpp
  

  

This all leads me to believe that something in the build no longer makes that 
dependency-reduced.jar available to the Java Driver code since those other libs 
in the assembly are probably all hadoop or Spark Executor code, not needed in 
the Mahout driver. This is likely to have been a side effect of the build 
refactoring

[~rawkintrevo_apache] does "dependencies-reduced.jar" which contains Scopt get 
its scala.compat.version fixed? It seems like the jar is missing anything with 
scala.compat.version but this may be a red herring.



> Drivers broken, scopt classes not found
> ---
>
> Key: MAHOUT-2023
> URL: https://issues.apache.org/jira/browse/MAHOUT-2023
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
> Environment: any
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For:

Re: Log-likelihood based correlation test?

2017-11-20 Thread Pat Ferrel

Yes, this will show the model. But if you do this a lot there are tools like 
Restlet that you plug in to Chrome. They will allow you to build queries of all 
sorts. For instance 
GET http://localhost:9200/urindex/_search?pretty 

will show the item rows of the UR model put into the index for the integration 
test data. The UI is a bit obtuse but you can scroll down in the right pane 
expanding bits of JSON as you go to see this:

"hits":{
"total": 7,
"max_score": 1,
"hits":[
{
"_index": "urindex_1511033890025",
"_type": "items",
"_id": "Nexus",
"_score": 1,
"_source":{
"defaultRank": 4,
"expires": "2017-11-04T19:01:23.655-07:00",
"countries":["United States", "Canada"],
"id": "Nexus",
"date": "2017-11-02T19:01:23.655-07:00",
"category-pref":["tablets"],
"categories":["Tablets", "Electronics", "Google"],
"available": "2017-10-31T19:01:23.655-07:00",
"purchase":[],
"popRank": 2,
"view":["Tablets"]
}
},

As you can see no purchased items survived the correlation test, one survived 
the view and category-pref correlation tests. The other fields are item 
properties set using $set events and are used with business rules.

 With something like this tool you can even take the query logged in the 
deployed PIO server and send it to see how the query is constructed and what 
the results are (same as you get from the SDK I’ll wager :-)



On Nov 20, 2017, at 7:07 AM, Daniel Gabrieli  wrote:

There is a REST client for Elasticsearch and bindings in many popular languages 
but to get started quickly I found this commands helpful:

List Indices:

curl -XGET 'localhost:9200/_cat/indices?v&pretty'

Get some documents from an index:

curl -XGET 'localhost:9200//_search?q=*&pretty'

Then look at the "_source" in the document to see what values are associated 
with the document.

More info here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html#_source
 
<https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html#_source>

this might also be helpful to work through a single specific query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html
 
<https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html>





On Mon, Nov 20, 2017 at 9:49 AM Noelia Osés Fernández mailto:no...@vicomtech.org>> wrote:
Thanks Daniel!

And excuse my ignorance but... how do you inspect the ES index?

On 20 November 2017 at 15:29, Daniel Gabrieli mailto:dgabri...@salesforce.com>> wrote:
There is this cli tool and article with more information that does produce 
scores:

https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html 
<https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html>

But I don't know of any commands that return diagnostics about LLR from the PIO 
framework / UR engine.  That would be a nice feature if it doesn't exist.  The 
way I've gotten some insight into what the model is doing is by when using PIO 
/ UR is by inspecting the the ElasticSearch index that gets created because it 
has the "significant" values populated in the documents (though not the actual 
LLR scores).

On Mon, Nov 20, 2017 at 7:22 AM Noelia Osés Fernández mailto:no...@vicomtech.org>> wrote:
This thread is very enlightening, thank you very much!

Is there a way I can see what the P, PtP, and PtL matrices of an app are? In 
the handmade case, for example?

Are there any pio calls I can use to get these?

On 17 November 2017 at 19:52, Pat Ferrel mailto:p...@occamsmachete.com>> wrote:
Mahout builds the model by doing matrix multiplication (PtP) then calculating 
the LLR score for every non-zero value. We then keep the top K or use a 
threshold to decide whether to keep of not (both are supported in the UR). LLR 
is a metric for seeing how likely 2 events in a large group are correlated. 
Therefore LLR is only used to remove weak data from the model.

So Mahout builds the model then it is put into Elasticsearch which is used as a 
KNN (K-nearest Neighbors) engine. The LLR score is not put into the model only 
an indicator that the item survived the LLR test.

The KNN is applied using the user’s history as the query and finding items the 
most closely match it. Since PtP will have items in rows and the row will have 
correlating items, this “search” methods work quite well to find items that had 
very similar items purchased with it as are in the user’s history.

=== that is the simple explanation 


Item-based recs ta

[jira] [Commented] (MAHOUT-2023) Drivers broken, scopt classes not found

2017-11-18 Thread Pat Ferrel (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16258258#comment-16258258
 ] 

Pat Ferrel commented on MAHOUT-2023:


Whoa, that is a big clue I think. Everything without a scala.compat.version is 
included in the file mahout-spark_2.10-0.13.1-SNAPSHOT-dependency-reduced.jar 
or whatever is generated for the Scala version but none of the classes that use 
scala.compat.version to resolve the classname.

Big clue but not sure where it leads [~rawkintrevo_apache] Any idea where to 
look from here?

> Drivers broken, scopt classes not found
> ---
>
> Key: MAHOUT-2023
> URL: https://issues.apache.org/jira/browse/MAHOUT-2023
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
> Environment: any
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.1
>
>
> Type `mahout spark-itemsimilarity` after Mahout is installed properly and you 
> get a fatal exception due to missing scopt classes.
> Probably a build issue related to incorrect versions of scopt being looked 
> for.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (MAHOUT-2023) Drivers broken, scopt classes not found

2017-11-18 Thread Pat Ferrel (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16258255#comment-16258255
 ] 

Pat Ferrel edited comment on MAHOUT-2023 at 11/18/17 11:32 PM:
---

ok not that MAHOUT-2020 is resolved, I looked at the scopt issue and found:
* all the correct scopt artifact exist in remote repos for all scala versions 
and they are being found by the mahout build.
* the ids for artifact etc are correct as per ^^^
* I checked all the tagged versions of Mahout back to 12.0. Not sure when the 
drivers stopped working but there has been no change to any reference to scopt 
in any POM. And since people have been using it and asking questions on the 
mailing list I will assume that up till the last build changes the drivers 
worked.
* The vienna-cl and java to c bindings are in the assembly pom so these classes 
are getting to the Spark Executors.
* I've checked compute-classpath.sh and the mahout script where changes were 
small and not relevant. 
* I've looked at the contents of the mahout*dependency-reduced.jar, which 
should have the things listed below and it does not, in only had guava, apache 
commons and fastutils. It is supposed to have:

  

  true
  
  
  
 META-INF/LICENSE
  
  
  runtime
  /
  true
  

com.google.guava:guava
com.github.scopt_${scala.compat.version}
com.tdunning:t-digest
org.apache.commons:commons-math3
it.unimi.dsi:fastutil

org.apache.mahout:mahout-native-viennacl_${scala.compat.version}

org.apache.mahout:mahout-native-viennacl-omp_${scala.compat.version}
org.bytedeco:javacpp
  

  

This all leads me to believe that something in the build no longer makes that 
dependency-reduced.jar available to the Java Driver code since those other libs 
in the assembly are probably all hadoop or Spark Executor code, not needed in 
the Mahout driver. This is likely to have been a side effect of the build 
refactoring

[~rawkintrevo_apache] does "dependencies-reduced.jar" which contains Scopt get 
its scala.compat.version fixed? It seems like the jar is missing anything with 
scala.compat.version but this may be a red herring.




was (Author: pferrel):
ok not that MAHOUT-2020 is resolved, I looked at the scopt issue and found:
* all the correct scopt artifact exist in remote repos for all scala versions 
and they are being found by the mahout build.
* the ids for artifact etc are correct as per ^^^
* I checked all the tagged versions of Mahout back to 12.0. Not sure when the 
drivers stopped working but there has been no change to any reference to scopt 
in any POM. And since people have been using it and asking questions on the 
mailing list I will assume that up till the last build changes the drivers 
worked.
* The vienna-cl and java to c bindings are in the assembly pom so these classes 
are getting to the Spark Executors.
* I've checked compute-classpath.sh and the mahout script where changes were 
small and not relevant. 
* I've looked at the contents of the mahout*dependency-reduced.jar, which 
should have the things listed below and it does not, in only had guava, apache 
commons and fastutils. It is supposed to have:

 {{ 

  true
  
  
  
 META-INF/LICENSE
  
  
  runtime
  /
  true
  

com.google.guava:guava
com.github.scopt_${scala.compat.version}
com.tdunning:t-digest
org.apache.commons:commons-math3
it.unimi.dsi:fastutil

org.apache.mahout:mahout-native-viennacl_${scala.compat.version}

org.apache.mahout:mahout-native-viennacl-omp_${scala.compat.version}
org.bytedeco:javacpp
  

  }}

This all leads me to believe that something in the build no longer makes that 
dependency-reduced.jar available to the Java Driver code since those other libs 
in the assembly are probably all hadoop or Spark Executor code, not needed in 
the Mahout driver. This is likely to have been a side effect of the build 
refactoring

[~rawkintrevo_apache] does "dependencies-reduced.jar" which contains Scopt get 
its scala.compat.version fixed? This doesn't seem to be the problem but is aa 
question nonetheless.



> Drivers broken, scopt classes not found
> ---
>
> Key: MAHOUT-2023
> URL: https://issues.apache.org/jira/browse/MAHOUT-2023
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>    Affects Versions: 0.13.1
>     Environment: any
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.1
>
>
> Type `mahout spark-itemsimilari

[jira] [Commented] (MAHOUT-2023) Drivers broken, scopt classes not found

2017-11-18 Thread Pat Ferrel (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16258255#comment-16258255
 ] 

Pat Ferrel commented on MAHOUT-2023:


ok not that MAHOUT-2020 is resolved, I looked at the scopt issue and found:
* all the correct scopt artifact exist in remote repos for all scala versions 
and they are being found by the mahout build.
* the ids for artifact etc are correct as per ^^^
* I checked all the tagged versions of Mahout back to 12.0. Not sure when the 
drivers stopped working but there has been no change to any reference to scopt 
in any POM. And since people have been using it and asking questions on the 
mailing list I will assume that up till the last build changes the drivers 
worked.
* The vienna-cl and java to c bindings are in the assembly pom so these classes 
are getting to the Spark Executors.
* I've checked compute-classpath.sh and the mahout script where changes were 
small and not relevant. 
* I've looked at the contents of the mahout*dependency-reduced.jar, which 
should have the things listed below and it does not, in only had guava, apache 
commons and fastutils. It is supposed to have:

 {{ 

  true
  
  
  
 META-INF/LICENSE
  
  
  runtime
  /
  true
  

com.google.guava:guava
com.github.scopt_${scala.compat.version}
com.tdunning:t-digest
org.apache.commons:commons-math3
it.unimi.dsi:fastutil

org.apache.mahout:mahout-native-viennacl_${scala.compat.version}

org.apache.mahout:mahout-native-viennacl-omp_${scala.compat.version}
org.bytedeco:javacpp
  

  }}

This all leads me to believe that something in the build no longer makes that 
dependency-reduced.jar available to the Java Driver code since those other libs 
in the assembly are probably all hadoop or Spark Executor code, not needed in 
the Mahout driver. This is likely to have been a side effect of the build 
refactoring

[~rawkintrevo_apache] does "dependencies-reduced.jar" which contains Scopt get 
its scala.compat.version fixed? This doesn't seem to be the problem but is aa 
question nonetheless.



> Drivers broken, scopt classes not found
> ---
>
> Key: MAHOUT-2023
> URL: https://issues.apache.org/jira/browse/MAHOUT-2023
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
> Environment: any
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.1
>
>
> Type `mahout spark-itemsimilarity` after Mahout is installed properly and you 
> get a fatal exception due to missing scopt classes.
> Probably a build issue related to incorrect versions of scopt being looked 
> for.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Re: Log-likelihood based correlation test?

2017-11-17 Thread Pat Ferrel

Mahout builds the model by doing matrix multiplication (PtP) then calculating 
the LLR score for every non-zero value. We then keep the top K or use a 
threshold to decide whether to keep of not (both are supported in the UR). LLR 
is a metric for seeing how likely 2 events in a large group are correlated. 
Therefore LLR is only used to remove weak data from the model.

So Mahout builds the model then it is put into Elasticsearch which is used as a 
KNN (K-nearest Neighbors) engine. The LLR score is not put into the model only 
an indicator that the item survived the LLR test.

The KNN is applied using the user’s history as the query and finding items the 
most closely match it. Since PtP will have items in rows and the row will have 
correlating items, this “search” methods work quite well to find items that had 
very similar items purchased with it as are in the user’s history.

=== that is the simple explanation 


Item-based recs take the model items (correlated items by the LLR test) as the 
query and the results are the most similar items—the items with most similar 
correlating items.

The model is items in rows and items in columns if you are only using one 
event. PtP. If you think it through, it is all purchased items in as the row 
key and other items purchased along with the row key. LLR filters out the 
weakly correlating non-zero values (0 mean no evidence of correlation anyway). 
If we didn’t do this it would be purely a “Cooccurrence” recommender, one of 
the first useful ones. But filtering based on cooccurrence strength (PtP values 
without LLR applied to them) produces much worse results than using LLR to 
filter for most highly correlated cooccurrences. You get a similar effect with 
Matrix Factorization but you can only use one type of event for various reasons.

Since LLR is a probabilistic metric that only looks at counts, it can be 
applied equally well to PtV (purchase, view), PtS (purchase, search terms), PtC 
(purchase, category-preferences). We did an experiment using Mean Average 
Precision for the UR using video “Likes” vs “Likes” and “Dislikes” so LtL vs. 
LtL and LtD scraped from rottentomatoes.com reviews and got a 20% lift in the 
MAP@k score by including data for “Dislikes”. 
https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/ 


So the benefit and use of LLR is to filter weak data from the model and allow 
us to see if dislikes, and other events, correlate with likes. Adding this type 
of data, that is usually thrown away is one the the most powerful reasons to 
use the algorithm—BTW the algorithm is called Correlated Cross-Occurrence (CCO).

The benefit of using Lucene (at the heart of Elasticsearch) to do the KNN query 
is that is it fast, taking the user’s realtime events into the query but also 
because it is is trivial to add all sorts or business rules. like give me recs 
based on user events but only ones from a certain category, of give me recs but 
only ones tagged as “in-stock” in fact the business rules can have inclusion 
rules, exclusion rules, and be mixed with ANDs and ORs.

BTW there is a version ready for testing with PIO 0.12.0 and ES5 here: 
https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT 
 
Instructions in the readme and notice it is in the 0.7.0-SNAPSHOT branch.


On Nov 17, 2017, at 7:59 AM, Andrew Troemner  wrote:

I'll echo Dan here. He and I went through the raw Mahout libraries called by 
the Universal Recommender, and while Noelia's description is accurate for an 
intermediate step, the indexing via ElasticSearch generates some separate 
relevancy scores based on their Lucene indexing scheme. The raw LLR scores are 
used in building this process, but the final scores served up by the API's 
should be post-processed, and cannot be used to reconstruct the raw LLR's (to 
my understanding).

There are also some additional steps including down-sampling, which scrubs out 
very rare combinations (which otherwise would have very high LLR's for a single 
observation), which partially corrects for the statistical problem of multiple 
detection. But the underlying logic is per Ted Dunning's research and 
summarized by Noelia, and is a solid way to approach interaction effects for 
tens of thousands of items and including secondary indicators (like 
demographics, or implicit preferences).

ANDREW TROEMNER
Associate Principal Data Scientist | salesforce.com 
Office: 317.832.4404
Mobile: 317.531.0216




 
On Fri, Nov 17, 2017 at 9:55 AM, Daniel Gabrieli mailto:dgabri...@salesforce.com>> wrote:
Maybe someone can correct me if I am wrong but in the code I believe 
Elasticsearch is used instead of "res

Re: Error in getting Total Events in a predictionIo App

2017-11-14 Thread Pat Ferrel

You should use pio 0.12.0 if you need Elasticsearch 5.x


On Nov 14, 2017, at 6:39 AM, Abhimanyu Nagrath  
wrote:

Hi , I am new to predictionIo using version V0.11-incubating (spark - 2.6.1 , 
hbase - 1.2.6 , elasticsearch - 5.2.1) . Started the prediction server with 
./pio-start-all and checked Pio status these are working fine. Then I created 
an app 'testApp' and imported some events into that predictionIO app, Now 
inorder to verify the count of imported events .I ran the following commands 

 1. pio-shell --with-spark
 2. import org.apache.predictionio.data.store.PEventStore
 3. val eventsRDD = PEventStore.find(appName="testApp")(sc)

I got the error:

ERROR Storage$: Error initializing storage client for source ELASTICSEARCH
java.lang.ClassNotFoundException: elasticsearch.StorageClient
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at 
org.apache.predictionio.data.storage.Storage$.getClient(Storage.scala:228)
at 
org.apache.predictionio.data.storage.Storage$.org$apache$predictionio$data$storage$Storage$$updateS2CM(Storage.scala:254)
at 
org.apache.predictionio.data.storage.Storage$$anonfun$sourcesToClientMeta$1.apply(Storage.scala:215)
at 
org.apache.predictionio.data.storage.Storage$$anonfun$sourcesToClientMeta$1.apply(Storage.scala:215)
at 
scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189)
at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91)
at 
org.apache.predictionio.data.storage.Storage$.sourcesToClientMeta(Storage.scala:215)
at 
org.apache.predictionio.data.storage.Storage$.getDataObject(Storage.scala:284)
at 
org.apache.predictionio.data.storage.Storage$.getDataObjectFromRepo(Storage.scala:269)
at 
org.apache.predictionio.data.storage.Storage$.getMetaDataApps(Storage.scala:387)
at 
org.apache.predictionio.data.store.Common$.appsDb$lzycompute(Common.scala:27)
at org.apache.predictionio.data.store.Common$.appsDb(Common.scala:27)
at 
org.apache.predictionio.data.store.Common$.appNameToId(Common.scala:32)
at 
org.apache.predictionio.data.store.PEventStore$.find(PEventStore.scala:71)
at 
$line19.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
at $line19.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
at $line19.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
at $line19.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:37)
at $line19.$read$$iwC$$iwC$$iwC$$iwC.(:39)
at $line19.$read$$iwC$$iwC$$iwC.(:41)
at $line19.$read$$iwC$$iwC.(:43)
at $line19.$read$$iwC.(:45)
at $line19.$read.(:47)
at $line19.$read$.(:51)
at $line19.$read$.()
at $line19.$eval$.(:7)
at $line19.$eval$.()
at $line19.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org 
$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org 
$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at o

Re: Which template for predicting ratings?

2017-11-13 Thread Pat Ferrel

What I was saying is the UR can use ratings, but not predict them. Use MLlib 
ALS recommenders if you want to predict them for all items.


On Nov 13, 2017, at 9:32 AM, Pat Ferrel  wrote:

What we did in the article I attached is assume 1-2 is dislike, and 4-5 is like.

These are treated as indicators and will produce a score from the recommender 
but these do not relate to 1-5 scores.

If you need to predict what the user would score an item MLlib ALS templates 
will do it.



On Nov 13, 2017, at 2:42 AM, Noelia Osés Fernández mailto:no...@vicomtech.org>> wrote:

Hi Pat,

I truly appreciate your advice.

However, what to do with a client that is adamant that they want to display the 
predicted ratings in the form of 1 to 5-stars? That's my case right now. 

I will pose a more concrete question. Is there any template for which the 
scores predicted by the algorithm are in the same range as the ratings in the 
training set?

Thank you very much for your help!
Noelia

On 10 November 2017 at 17:57, Pat Ferrel mailto:p...@occamsmachete.com>> wrote:
Any of the Spark MLlib ALS recommenders in the PIO template gallery support 
ratings.

However I must warn that ratings are not very good for recommendations and none 
of the big players use ratings anymore, Netflix doesn’t even display them. The 
reason is that your 2 may be my 3 or 4 and that people rate different 
categories differently. For instance Netflix found Comedies were rated lower 
than Independent films. There have been many solutions proposed and tried but 
none have proven very helpful.

There is another more fundamental problem, why would you want to recommend the 
highest rated item? What do you buy on Amazon or watch on Netflix? Are they 
only your highest rated items. Research has shown that they are not. There was 
a whole misguided movement around ratings that affected academic papers and 
cross-validation metrics that has fairly well been discredited. It all came 
from the Netflix prize that used both. Netflix has since led the way in 
dropping ratings as they saw the things I have mentioned.

What do you do? Categorical indicators work best (like, dislike)or implicit 
indicators (buy) that are unambiguous. If a person buys something, they like 
it, if the rate it 3 do they like it? I buy many 3 rated items on Amazon if I 
need them. 

My advice is drop ratings and use thumbs up or down. These are unambiguous and 
the thumbs down can be used in some cases to predict thumbs up: 
https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/ 
<https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/>
 This uses data from a public web site to show significant lift by using “like” 
and “dislike” in recommendations. This used the Universal Recommender.


On Nov 10, 2017, at 5:02 AM, Noelia Osés Fernández mailto:no...@vicomtech.org>> wrote:


Hi all,

I'm new to PredictionIO so I apologise if this question is silly.

I have an application in which users are rating different items in a scale of 1 
to 5 stars. I want to recommend items to a new user and give her the predicted 
rating in number of stars. Which template should I use to do this? Note that I 
need the predicted rating to be in the same range of 1 to 5 stars.

Is it possible to do this with the ecommerce recommendation engine?

Thank you very much for your help!
Noelia









-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

no...@vicomtech.org <mailto:no...@vicomtech.org>
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

 <https://www.linkedin.com/company/vicomtech>  
<https://www.youtube.com/user/VICOMTech>  <https://twitter.com/@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/> <http://www.ik4.es/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

Re: Which template for predicting ratings?

2017-11-13 Thread Pat Ferrel

What we did in the article I attached is assume 1-2 is dislike, and 4-5 is like.

These are treated as indicators and will produce a score from the recommender 
but these do not relate to 1-5 scores.

If you need to predict what the user would score an item MLlib ALS templates 
will do it.



On Nov 13, 2017, at 2:42 AM, Noelia Osés Fernández  wrote:

Hi Pat,

I truly appreciate your advice.

However, what to do with a client that is adamant that they want to display the 
predicted ratings in the form of 1 to 5-stars? That's my case right now. 

I will pose a more concrete question. Is there any template for which the 
scores predicted by the algorithm are in the same range as the ratings in the 
training set?

Thank you very much for your help!
Noelia

On 10 November 2017 at 17:57, Pat Ferrel mailto:p...@occamsmachete.com>> wrote:
Any of the Spark MLlib ALS recommenders in the PIO template gallery support 
ratings.

However I must warn that ratings are not very good for recommendations and none 
of the big players use ratings anymore, Netflix doesn’t even display them. The 
reason is that your 2 may be my 3 or 4 and that people rate different 
categories differently. For instance Netflix found Comedies were rated lower 
than Independent films. There have been many solutions proposed and tried but 
none have proven very helpful.

There is another more fundamental problem, why would you want to recommend the 
highest rated item? What do you buy on Amazon or watch on Netflix? Are they 
only your highest rated items. Research has shown that they are not. There was 
a whole misguided movement around ratings that affected academic papers and 
cross-validation metrics that has fairly well been discredited. It all came 
from the Netflix prize that used both. Netflix has since led the way in 
dropping ratings as they saw the things I have mentioned.

What do you do? Categorical indicators work best (like, dislike)or implicit 
indicators (buy) that are unambiguous. If a person buys something, they like 
it, if the rate it 3 do they like it? I buy many 3 rated items on Amazon if I 
need them. 

My advice is drop ratings and use thumbs up or down. These are unambiguous and 
the thumbs down can be used in some cases to predict thumbs up: 
https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/ 
<https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/>
 This uses data from a public web site to show significant lift by using “like” 
and “dislike” in recommendations. This used the Universal Recommender.


On Nov 10, 2017, at 5:02 AM, Noelia Osés Fernández mailto:no...@vicomtech.org>> wrote:


Hi all,

I'm new to PredictionIO so I apologise if this question is silly.

I have an application in which users are rating different items in a scale of 1 
to 5 stars. I want to recommend items to a new user and give her the predicted 
rating in number of stars. Which template should I use to do this? Note that I 
need the predicted rating to be in the same range of 1 to 5 stars.

Is it possible to do this with the ecommerce recommendation engine?

Thank you very much for your help!
Noelia









-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

no...@vicomtech.org <mailto:no...@vicomtech.org>
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

 <https://www.linkedin.com/company/vicomtech>  
<https://www.youtube.com/user/VICOMTech>  <https://twitter.com/@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/> <http://www.ik4.es/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

Re: Does PIO support [ --master yarn --deploy-mode cluster ]?

2017-11-13 Thread Pat Ferrel

yarn-cluster mode is supported but extra config needs to be set so the driver
can be run on a remote machine.

I have seen instructions for this on the PIO mailing list.

On Nov 12, 2017, at 7:30 PM, wei li wrote:

Hi Pat
Thanks a lot for your advice.

We are using [yarn-client] mode now, UR trains well and we can monitor the
output log at pio application console.

I tried to find a way to use [yarn-cluster] mode, to submit a train job and
shutdown the pio application (in docker) immediately.
(monitor the job process at hadoop culster website instead of pio application
console).
But then I met errors like this: file path [file://xxx.jar] can not be found.

Maybe, [yarn-cluster] mode is not supported now. I will keep looking for the
explanation.

在 2017年11月11日星期六 UTC+8上午12:41:33，pat写道：
Yes PIO support Yarn but you may have more luck getting an explanation on the
PredictionIO mailing list.
Subscribe here: http://predictionio.incubator.apache.org/support/

On Nov 9, 2017, at 11:33 PM, wei li > wrote:

Hi, all

Any one have any idea about this?

--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to actionml-use...@googlegroups.com .
To post to this group, send email to action...@googlegroups.com .
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/af5c6748-ae7f-4c05-bbc5-6dcf6c1a480a%40googlegroups.com

.
For more options, visit https://groups.google.com/d/optout
.

--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to actionml-user+unsubscr...@googlegroups.com
.
To post to this group, send email to actionml-u...@googlegroups.com
.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/8668b1a1-09b9-4de8-aedb-5b786a9cf7e4%40googlegroups.com

.
For more options, visit https://groups.google.com/d/optout
.

Re: "LLR with time"

2017-11-12 Thread Pat Ferrel

ing content being
> overwhelmed. You might also be able to spot content that has intense
> interest from a sub-population as opposed to diffuse interest from a mass
> population.
> 
> You can also use novelty and trending boosts for content in the normal
> recommendation engine. I have avoided this in the past because I felt it
> was better to have specialized pages for what's new and hot rather than
> because I had data saying it was bad to do. I have put a very weak
> recommendation effect on the what's hot pages so that people tend to see
> trending material that they like. That doesn't help on what's new pages for
> obvious reasons unless you use a touch of second order recommendation.
> 
> 
> 
> 
> 
> On Sat, Nov 11, 2017 at 11:00 PM, Johannes Schulte <
> johannes.schu...@gmail.com> wrote:
> 
>> Well the greece thing was just an example for a thing you don't know
>> upfront - it could be any of the modeled feature on the cross recommender
>> input side (user segment, country, city, previous buys), some
> subpopulation
>> getting active, so the current approach, probably with sampling that
>> favours newer events, will be the best here. Luckily a sampling strategy
> is
>> a big topic anyway since we're trying to go for the near real time way -
>> pat, you talked about it some while ago on this list and i still have to
>> look at the flink talk from trevor grant but I'm really eager to attack
>> this after years of batch :)
>> 
>> Thanks for your thoughts, I am happy I can rule something out given the
>> domain (poisson llr). Luckily the domain I'm working on is event
>> recommendations, so there is a natural deterministic item expiry (as
>> compared to christmas like stuff).
>> 
>> Again,
>> thanks!
>> 
>> 
>> On Sat, Nov 11, 2017 at 7:00 PM, Ted Dunning 
>> wrote:
>> 
>>> Inline.
>>> 
>>> On Sat, Nov 11, 2017 at 6:31 PM, Pat Ferrel 
>> wrote:
>>> 
>>>> If Mahout were to use http://bit.ly/poisson-llr it would tend to
> favor
>>>> new events in calculating the LLR score for later use in the
> threshold
>>> for
>>>> whether a co or cross-occurrence iss incorporated in the model.
>>> 
>>> 
>>> I don't think that this would actually help for most recommendation
>>> purposes.
>>> 
>>> It might help to determine that some item or other has broken out of
>>> historical rates. Thus, we might have "hotness" as a detected feature
>> that
>>> could be used as a boost at recommendation time. We might also have
> "not
>>> hotness" as a negative boost feature.
>>> 
>>> Since we have a pretty good handle on the "other" counts, I don't think
>>> that the Poisson test would help much with the cooccurrence stuff
> itself.
>>> 
>>> Changing the sampling rule could make a difference to temporality and
>> would
>>> be more like what Johannes is asking about.
>>> 
>>> 
>>>> But it doesn’t relate to popularity as I think Ted is saying.
>>>> 
>>>> Are you looking for 1) personal recommendations biased by hotness in
>>>> Greece or 2) things hot in Greece?
>>>> 
>>>> 1) create a secondary indicator for “watched in some locale” the
>> local-id
>>>> uses a country-code+postal-code maybe but not lat-lon. Something that
>>>> includes a good number of people/events. The the query would be
>> user-id,
>>>> and user-locale. This would yield personal recs preferred in the
> user’s
>>>> locale. Athens-west-side in this case.
>>>> 
>>> 
>>> And this works in the current regime. Simply add location tags to the
>> user
>>> histories and do cooccurrence against content. Locations will pop out
> as
>>> indicators for some content and not for others. Then when somebody
>> appears
>>> in some location, their tags will retrieve localized content.
>>> 
>>> For localization based on strict geography, say for restaurant search,
> we
>>> can just add business rules based on geo-search. A very large bank
>> customer
>>> of ours does that, for instance.
>>> 
>>> 
>>>> 2) split the data into locales and do the hot calc I mention. The
> query
>>>> would have no user-id since it is not personalized but would yield
> “hot
>>> in
>>>> Greece”
>>>> 
>>> 
>>> I think that this is

Re: "LLR with time"

2017-11-11 Thread Pat Ferrel

If Mahout were to use http://bit.ly/poisson-llr it would tend to favor new 
events in calculating the LLR score for later use in the threshold for whether 
a co or cross-occurrence iss incorporated in the model. This is very 
interesting and would be useful in cases where you can keep a lot of data or 
where recent data is far more important, like news. This is the time-aware 
G-test your are referencing as I understand it.

But it doesn’t relate to popularity as I think Ted is saying.

Are you looking for 1) personal recommendations biased by hotness in Greece or 
2) things hot in Greece?

1) create a secondary indicator for “watched in some locale” the local-id uses 
a country-code+postal-code maybe but not lat-lon. Something that includes a 
good number of people/events. The the query would be user-id, and user-locale. 
This would yield personal recs preferred in the user’s locale. Athens-west-side 
in this case.
2) split the data into locales and do the hot calc I mention. The query would 
have no user-id since it is not personalized but would yield “hot in Greece”

Ted’s “Christmas video” tag is what I was calling a business rule and can be 
added to either of the above techniques.

On Nov 11, 2017, at 4:01 AM, Ted Dunning  wrote:

So ... there are a few different threads here.

1) LLR but with time. Quite possible, but not really what Johannes is
talking about, I think. See http://bit.ly/poisson-llr for a quick
discussion.

2) time varying recommendation. As Johannes notes, this can make use of
windowed counts. The problem is that rarely accessed items should probably
have longer windows so that we use longer term trends when we have less
data.

The good news here is that this some part of this is nearly already in the
code. The trick is that the down-sampling used in the system can be adapted
to favor recent events over older ones. That means that if the meaning of
something changes over time, the system will catch on. Likewise, if
something appears out of nowhere, it will quickly train up. This handles
the popular in Greece right now problem.

But this isn't the whole story of changing recommendations. Another problem
that we commonly face is what I call the christmas music issue. The idea is
that there are lots of recommendations for music that are highly seasonal.
Thus, Bing Crosby fans want to hear White Christmas
<https://www.youtube.com/watch?v=P8Ozdqzjigg> until the day after christmas
at which point this becomes a really bad recommendation. To some degree,
this can be partially dealt with by using temporal tags as indicators, but
that doesn't really allow a recommendation to be completely shut down.

The only way that I have seen to deal with this in the past is with a
manually designed kill switch. As much as possible, we would tag the
obviously seasonal content and then add a filter to kill or downgrade that
content the moment it went out of fashion.

On Sat, Nov 11, 2017 at 9:43 AM, Johannes Schulte <
johannes.schu...@gmail.com> wrote:

> Pat, thanks for your help. especially the insights on how you handle the
> system in production and the tips for multiple acyclic buckets.
> Doing the combination signalls when querying sounds okay but as you say,
> it's always hard to find the right boosts without setting up some ltr
> system. If there would be a way to use the hotness when calculating the
> indicators for subpopulations it would be great., especially for a cross
> recommender.
> 
> e.g. people in greece _now_ are viewing this show/product  whatever
> 
> And here the popularity of the recommended item in this subpopulation could
> be overrseen when just looking at the overall derivatives of activity.
> 
> Maybe one could do multiple G-Tests using sliding windows
> * itemA&itemB  vs population (classic)
> * itemA&itemB(t) vs itemA&itemB(t-1)
> ..
> 
> and derive multiple indicators per item to be indexed.
> 
> But this all relies on discretizing time into buckets and not looking at
> the distribution of time between events like in presentation above - maybe
> there is  something way smarter
> 
> Johannes
> 
> On Sat, Nov 11, 2017 at 2:50 AM, Pat Ferrel  wrote:
> 
>> BTW you should take time buckets that are relatively free of daily cycles
>> like 3 day, week, or month buckets for “hot”. This is to remove cyclical
>> affects from the frequencies as much as possible since you need 3 buckets
>> to see the change in change, 2 for the change, and 1 for the event
> volume.
>> 
>> 
>> On Nov 10, 2017, at 4:12 PM, Pat Ferrel  wrote:
>> 
>> So your idea is to find anomalies in event frequencies to detect “hot”
>> items?
>> 
>> Interesting, maybe Ted will chime in.
>> 
>> What I do is take the frequency, first, and second, derivatives as
>> measures of popularity, increasing popul

Re: "LLR with time"

2017-11-10 Thread Pat Ferrel

BTW you should take time buckets that are relatively free of daily cycles like 
3 day, week, or month buckets for “hot”. This is to remove cyclical affects 
from the frequencies as much as possible since you need 3 buckets to see the 
change in change, 2 for the change, and 1 for the event volume.


On Nov 10, 2017, at 4:12 PM, Pat Ferrel  wrote:

So your idea is to find anomalies in event frequencies to detect “hot” items?

Interesting, maybe Ted will chime in.

What I do is take the frequency, first, and second, derivatives as measures of 
popularity, increasing popularity, and increasingly increasing popularity. Put 
another way popular, trending, and hot. This is simple to do by taking 1, 2, or 
3 time buckets and looking at the number of events, derivative (difference), 
and second derivative. Ranking all items by these value gives various measures 
of popularity or its increase. 

If your use is in a recommender you can add a ranking field to all items and 
query for “hot” by using the ranking you calculated. 

If you want to bias recommendations by hotness, query with user history and 
boost by your hot field. I suspect the hot field will tend to overwhelm your 
user history in this case as it would if you used anomalies so you’d also have 
to normalize the hotness to some range closer to the one created by the user 
history matching score. I haven’t found a vey good way to mix these in a model 
so use hot as a method of backfill if you cannot return enough recommendations 
or in places where you may want to show just hot items. There are several 
benefits to this method of using hot to rank all items including the fact that 
you can apply business rules to them just as normal recommendations—so you can 
ask for hot in “electronics” if you know categories, or hot "in-stock" items, 
or ...

Still anomaly detection does sound like an interesting approach.


On Nov 10, 2017, at 3:13 PM, Johannes Schulte  
wrote:

Hi "all",

I am wondering what would be the best way to incorporate event time
information into the calculation of the G-Test.

There is a claim here
https://de.slideshare.net/tdunning/finding-changes-in-real-data

saying "Time aware variant of G-Test is possible"

I remember i experimented with exponentially decayed counts some years ago
and this involved changing the counts to doubles, but I suspect there is
some smarter way. What I don't get is the relation to a data structure like
T-Digest when working with a lot of counts / cells for every combination of
items. Keeping a t-digest for every combination seems unfeasible.

How would one incorporate event time into recommendations to detect
"hotness" of certain relations? Glad if someone has an idea...

Cheers,

Johannes

Re: "LLR with time"

2017-11-10 Thread Pat Ferrel

So your idea is to find anomalies in event frequencies to detect “hot” items?

Interesting, maybe Ted will chime in.

What I do is take the frequency, first, and second, derivatives as measures of 
popularity, increasing popularity, and increasingly increasing popularity. Put 
another way popular, trending, and hot. This is simple to do by taking 1, 2, or 
3 time buckets and looking at the number of events, derivative (difference), 
and second derivative. Ranking all items by these value gives various measures 
of popularity or its increase. 

If your use is in a recommender you can add a ranking field to all items and 
query for “hot” by using the ranking you calculated. 

If you want to bias recommendations by hotness, query with user history and 
boost by your hot field. I suspect the hot field will tend to overwhelm your 
user history in this case as it would if you used anomalies so you’d also have 
to normalize the hotness to some range closer to the one created by the user 
history matching score. I haven’t found a vey good way to mix these in a model 
so use hot as a method of backfill if you cannot return enough recommendations 
or in places where you may want to show just hot items. There are several 
benefits to this method of using hot to rank all items including the fact that 
you can apply business rules to them just as normal recommendations—so you can 
ask for hot in “electronics” if you know categories, or hot "in-stock" items, 
or ...

Still anomaly detection does sound like an interesting approach.

 
On Nov 10, 2017, at 3:13 PM, Johannes Schulte  
wrote:

Hi "all",

I am wondering what would be the best way to incorporate event time
information into the calculation of the G-Test.

There is a claim here
https://de.slideshare.net/tdunning/finding-changes-in-real-data

saying "Time aware variant of G-Test is possible"

I remember i experimented with exponentially decayed counts some years ago
and this involved changing the counts to doubles, but I suspect there is
some smarter way. What I don't get is the relation to a data structure like
T-Digest when working with a lot of counts / cells for every combination of
items. Keeping a t-digest for every combination seems unfeasible.

How would one incorporate event time into recommendations to detect
"hotness" of certain relations? Glad if someone has an idea...

Cheers,

Johannes

Re: Which template for predicting ratings?

2017-11-10 Thread Pat Ferrel

Any of the Spark MLlib ALS recommenders in the PIO template gallery support 
ratings.

However I must warn that ratings are not very good for recommendations and none 
of the big players use ratings anymore, Netflix doesn’t even display them. The 
reason is that your 2 may be my 3 or 4 and that people rate different 
categories differently. For instance Netflix found Comedies were rated lower 
than Independent films. There have been many solutions proposed and tried but 
none have proven very helpful.

There is another more fundamental problem, why would you want to recommend the 
highest rated item? What do you buy on Amazon or watch on Netflix? Are they 
only your highest rated items. Research has shown that they are not. There was 
a whole misguided movement around ratings that affected academic papers and 
cross-validation metrics that has fairly well been discredited. It all came 
from the Netflix prize that used both. Netflix has since led the way in 
dropping ratings as they saw the things I have mentioned.

What do you do? Categorical indicators work best (like, dislike)or implicit 
indicators (buy) that are unambiguous. If a person buys something, they like 
it, if the rate it 3 do they like it? I buy many 3 rated items on Amazon if I 
need them. 

My advice is drop ratings and use thumbs up or down. These are unambiguous and 
the thumbs down can be used in some cases to predict thumbs up: 
https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/ 

 This uses data from a public web site to show significant lift by using “like” 
and “dislike” in recommendations. This used the Universal Recommender.


On Nov 10, 2017, at 5:02 AM, Noelia Osés Fernández  wrote:


Hi all,

I'm new to PredictionIO so I apologise if this question is silly.

I have an application in which users are rating different items in a scale of 1 
to 5 stars. I want to recommend items to a new user and give her the predicted 
rating in number of stars. Which template should I use to do this? Note that I 
need the predicted rating to be in the same range of 1 to 5 stars.

Is it possible to do this with the ecommerce recommendation engine?

Thank you very much for your help!
Noelia

Re: PIO + ES5 + Universal Recommender

2017-11-08 Thread Pat Ferrel

“mvn not found”, install mvn. 

This step will go away with the next Mahout release.


On Nov 8, 2017, at 2:41 AM, Noelia Osés Fernández  wrote:

Thanks Pat!

I have followed the instructions on the README.md file of the mahout folder:


You will need to build this using Scala 2.11. Follow these instructions

 - install Scala 2.11 as your default version

I've done this with the following commands:

# scala install
wget www.scala-lang.org/files/archive/scala-2.11.7.deb 
<http://www.scala-lang.org/files/archive/scala-2.11.7.deb>
sudo dpkg -i scala-2.11.7.deb
# sbt installation
echo "deb https://dl.bintray.com/sbt/debian <https://dl.bintray.com/sbt/debian> 
/" | sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 
<http://keyserver.ubuntu.com/> --recv 642AC823
sudo apt-get update
sudo apt-get install sbt

 - download this repo: `git clone https://github.com/actionml/mahout.git` 
<https://github.com/actionml/mahout.git%60>
 - checkout the speedup branch: `git checkout sparse-speedup-13.0`
 - edit the build script `build-scala-2.11.sh <http://build-scala-2.11.sh/>` to 
put the custom repo where you want it

This file is now:

#!/usr/bin/env bash

git checkout sparse-speedup-13.0

mvn clean package -DskipTests -Phadoop2 -Dspark.version=2.1.1 
-Dscala.version=2.11.11 -Dscala.compat.version=2.11

echo "Make sure to put the custom repo in the right place for your machine!"
echo "This location will have to be put into the Universal Recommenders 
build.sbt"

mvn deploy:deploy-file 
-Durl=file:///home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/.custom-scala-m2/repo/
 
-Dfile=//home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/mahout/hdfs/target/mahout-hdfs-0.13.0.jar
 -DgroupId=org.apache.mahout -DartifactId=mahout-hdfs -Dversion=0.13.0
mvn deploy:deploy-file 
-Durl=file:///home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/.custom-scala-m2/repo/
 
-Dfile=//home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/mahout/math/target/mahout-math-0.13.0.jar
 -DgroupId=org.apache.mahout -DartifactId=mahout-math -Dversion=0.13.0
mvn deploy:deploy-file 
-Durl=file:///home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/.custom-scala-m2/repo/
 
-Dfile=//home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/mahout/math-scala/target/mahout-math-scala_2.11-0.13.0.jar
 -DgroupId=org.apache.mahout -DartifactId=mahout-math-scala_2.11 
-Dversion=0.13.0
mvn deploy:deploy-file 
-Durl=file:///home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/.custom-scala-m2/repo/
 
-Dfile=//home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/mahout/spark/target/mahout-spark_2.11-0.13.0.jar
 -DgroupId=org.apache.mahout -DartifactId=mahout-spark_2.11 -Dversion=0.13.0

 - execute the build script `build-scala-2.11.sh <http://build-scala-2.11.sh/>`

This outputed the following:

$ ./build-scala-2.11.sh <http://build-scala-2.11.sh/> 
Mbuild-scala-2.11.sh <http://build-scala-2.11.sh/>
Already on 'sparse-speedup-13.0'
Your branch is up-to-date with 'origin/sparse-speedup-13.0'.
./build-scala-2.11.sh <http://build-scala-2.11.sh/>: line 5: mvn: command not 
found
Make sure to put the custom repo in the right place for your machine!
This location will have to be put into the Universal Recommenders build.sbt
./build-scala-2.11.sh <http://build-scala-2.11.sh/>: line 10: mvn: command not 
found
./build-scala-2.11.sh <http://build-scala-2.11.sh/>: line 11: mvn: command not 
found
./build-scala-2.11.sh <http://build-scala-2.11.sh/>: line 12: mvn: command not 
found
./build-scala-2.11.sh <http://build-scala-2.11.sh/>: line 13: mvn: command not 
found


Do I need to install MAVEN? If so, it is not said in the PredictionIO 
installation instructions nor on the Mahout instructions. 

I apologise if this is an obvious question for those familiar with the Apache 
projects, but for an outsider like me it helps when everything (even the most 
silly details) is spelled out. Thanks a lot for all your invaluable help!!
 

On 7 November 2017 at 20:58, Pat Ferrel mailto:p...@occamsmachete.com>> wrote:
Very sorry, it was incorrectly set to private. Try it again.




On Nov 7, 2017, at 7:26 AM, Pat Ferrel mailto:p...@occamsmachete.com>> wrote:

https://github.com/actionml/mahout <https://github.com/actionml/mahout>





-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

no...@vicomtech.org <mailto:no...@vicomtech.org>

Re: PIO + ES5 + Universal Recommender

2017-11-07 Thread Pat Ferrel

Very sorry, it was incorrectly set to private. Try it again.




On Nov 7, 2017, at 7:26 AM, Pat Ferrel mailto:p...@occamsmachete.com>> wrote:

https://github.com/actionml/mahout <https://github.com/actionml/mahout>

Re: PIO + ES5 + Universal Recommender

2017-11-07 Thread Pat Ferrel

Very sorry, it was incorrectly set to private. Try it again.

On Nov 7, 2017, at 12:52 AM, Noelia Osés Fernández wrote:

Thank you, Pat!

I have a problem with the Mahout repo, though. I get the following error
message:

remote: Repository not found.
fatal: repository 'https://github.com/actionml/mahout.git/
<https://github.com/actionml/mahout.git/>' not found

On 3 November 2017 at 22:27, Pat Ferrel mailto:p...@occamsmachete.com>> wrote:
The exclusion rules are working now along with the integration-test. We have
some cleanup but please feel free to try it.

Please note the upgrade issues mentioned below before you start, fresh installs
should have no such issues.

On Nov 1, 2017, at 4:30 PM, Pat Ferrel mailto:p...@occamsmachete.com>> wrote:

Ack, I hate this &^%&%^& touchbar!

What I meant to say was:

We have a version of the universal recommender working with PIO-0.12.0 that is
ready for brave souls to test. This includes some speedups and quality of
recommendation improvements, not yet documented.

Known bugs: exclusion rules not working. This will be fixed before release in
the next few days

Issues: do not trust the integration test, Lucene and ES have changed their
scoring method and so you cannot compare the old scores to the new ones. The
test will be fixed before release but do trust it to populate PIO with some
sample data you can play with.

You must build PredictionIO with the default parameters so just run
`./make-distribution` this will require you to install Scala 2.11, Spark 2.1.1
or greater, ES 5.5.2 or greater, Hadoop 2.6 or greater. If you have issues
getting pio to build and run send questions to the PIO mailing list. Once PIO
is running test with `pio status` and `pio app list`. You will need to create
an app in import your data to run the integration test to get some sample data
installed in the “handmade” app.

*Backup your data*, moving from ES 1 to ES 5 will delete all data Actually
even worse it is still in HBase but you can’t get at it so to upgrade so the
following
`pio export` with pio < 0.12.0 =*Before upgrade!*=
`pio data-delete` all your old apps =*Before upgrade!*=
build and install pio 0.12.0 including all the services =*The point of no
return!*=
`pio app new …` and `pio import…` any needed datasets
download and build Mahout for Scala 2.11 from this repo:
https://github.com/actionml/mahout.git <https://github.com/actionml/mahout.git>
follow the instructions in the README.md
download the UR from here:
https://github.com/actionml/universal-recommender.git
<https://github.com/actionml/universal-recommender.git> and checkout branch
0.7.0-SNAPSHOT
replace the line: `resolvers += "Local Repository" at
"file:///Users/pat/.custom-scala-m2/repo”` <> with your path to the local
mahout build
build the UR with `pio build` or run the integration test to get sample data
put into PIO `./examples/integration-test`

This will use the released PIO and alpha UR

This will be much easier when it’s released

--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to actionml-user+unsubscr...@googlegroups.com
<mailto:actionml-user+unsubscr...@googlegroups.com>.
To post to this group, send email to actionml-u...@googlegroups.com
<mailto:actionml-u...@googlegroups.com>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/326BE669-574B-45A5-AAA5-6A285BA0B33E%40occamsmachete.com

<https://groups.google.com/d/msgid/actionml-user/326BE669-574B-45A5-AAA5-6A285BA0B33E%40occamsmachete.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout
<https://groups.google.com/d/optout>.

--
<http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

no...@vicomtech.org <mailto:no...@vicomtech.org>
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

<https://www.linkedin.com/company/vicomtech>
<https://www.youtube.com/user/VICOMTech> <https://twitter.com/@Vicomtech_IK4>

member of: <http://www.graphicsmedia.net/> <http://www.ik4.es/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to actionml-user+unsubscr...@googlegroups.com
<mailto:actionml-user+unsubscr...@googlegroups.com>.
To post to this group, send email to actionml-u...@googlegroups.com
<mailto:actionml-u...@googlegroups.com>.
To view this discussion on th

Re: Implementing cart and wishlist item events into Ecommerce recommendation template

2017-11-04 Thread Pat Ferrel

Oh, forgot to say the most important part. The ECom recommender does not
support shopping carts unless you train on (cart-id, item-id-of item
added-to-cart) And even then I’m not sure you can query with the current cart’s
contents since the item-based query is for a single item. The cart-id takes the
place of user-id in this method of training and there may be a way to do this
in the MLlib implementation but It isn’t surfaced in the PIO interface. It
would be explained as an anonymous user (one not in the training data) and will
take an item list in the query. Look into the MLlib ALS library and expect to
modify the template code.

There is also the Complimentary Purchase template, which does shopping carts
but, from my rather prejudiced viewpoint, if you need to switch templates use
one that supports every use-case you are likely to need.

On Nov 4, 2017, at 9:34 AM, Pat Ferrel wrote:

The Universal Recommender supports several types of “item-set” recommendations:
1) Complimentary Purchases. which are things bought with what you have in the
shopping cart. This is done by training on (cart-id, “add-to-cart”, item-id)
and querying with the current items in the user’s cart.
2) Similar items to those in the cart, this is done by training with the
typical events like purchase, detail-view, add-to-cart., etc. for each user,
then the query is the contents of the shopping cart as a “item-set”. This give
things similar to what is in the cart and usually not the precise semantics for
a shopping cart but fits other cases of using an items-set, like wish-lists
3) take the last n items viewed and query with them and you have
“recommendations based on your recent views” In this case you need purchases as
the primary event because you want to recommend purchases but using only
“detail-views” to do so.
4) some other combinations like favorites, watch-lists, etc.

These work slightly different and I could give examples of how they are used in
Amazon but #1 is typically used for the “shopping cart"

On Nov 3, 2017, at 7:13 PM, ilker burak mailto:ilkerbu...@gmail.com>> wrote:

Hi Vaghan,
I will check that. Thanks for your help and quick answer about this.

On Fri, Nov 3, 2017 at 8:02 AM, Vaghawan Ojha mailto:vaghawan...@gmail.com>> wrote:
Hey there,

did you consider seeing this:
https://predictionio.incubator.apache.org/templates/ecommercerecommendation/train-with-rate-event/

<https://predictionio.incubator.apache.org/templates/ecommercerecommendation/train-with-rate-event/>

for considering such events you may want to use the $set events as shown in the
template documentation. I use universal recommender though since already
supports these requirements.

Hope this helps.

On Fri, Nov 3, 2017 at 10:37 AM, ilker burak mailto:ilkerbu...@gmail.com>> wrote:
Hello,
I am using Ecommerce recommendation template. Currently i imported view and buy
events and it works. To improve results accuracy, how can i modify code to
import and use events like 'user added item to cart' and 'user added item to
wishlist'? I know this template supports to add new events but there is only
example in site about how to implement rate event, whic i am not using rate
data.
Thank you

Ilker

--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to actionml-user+unsubscr...@googlegroups.com
<mailto:actionml-user+unsubscr...@googlegroups.com>.
To post to this group, send email to actionml-u...@googlegroups.com
<mailto:actionml-u...@googlegroups.com>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/741770E9-1FE6-453C-9ED2-54F1745CAE33%40occamsmachete.com

<https://groups.google.com/d/msgid/actionml-user/741770E9-1FE6-453C-9ED2-54F1745CAE33%40occamsmachete.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout
<https://groups.google.com/d/optout>.

Re: Implementing cart and wishlist item events into Ecommerce recommendation template

2017-11-04 Thread Pat Ferrel

These work slightly different and I could give examples of how they are used in
Amazon but #1 is typically used for the “shopping cart"

On Nov 3, 2017, at 7:13 PM, ilker burak wrote:

Hi Vaghan,
I will check that. Thanks for your help and quick answer about this.

On Fri, Nov 3, 2017 at 8:02 AM, Vaghawan Ojha mailto:vaghawan...@gmail.com>> wrote:
Hey there,

did you consider seeing this:
https://predictionio.incubator.apache.org/templates/ecommercerecommendation/train-with-rate-event/

for considering such events you may want to use the $set events as shown in the
template documentation. I use universal recommender though since already
supports these requirements.

Hope this helps.

Ilker

Re: PIO + ES5 + Universal Recommender

2017-11-03 Thread Pat Ferrel

The exclusion rules are working now along with the integration-test. We have
some cleanup but please feel free to try it.

Please note the upgrade issues mentioned below before you start, fresh installs
should have no such issues.

On Nov 1, 2017, at 4:30 PM, Pat Ferrel wrote:

Ack, I hate this &^%&%^& touchbar!

What I meant to say was:

Known bugs: exclusion rules not working. This will be fixed before release in
the next few days

*Backup your data*, moving from ES 1 to ES 5 will delete all data Actually
even worse it is still in HBase but you can’t get at it so to upgrade so the
following
`pio export` with pio < 0.12.0 =*Before upgrade!*=
`pio data-delete` all your old apps =*Before upgrade!*=
build and install pio 0.12.0 including all the services =*The point of no
return!*=
`pio app new …` and `pio import…` any needed datasets
download and build Mahout for Scala 2.11 from this repo:
https://github.com/actionml/mahout.git <https://github.com/actionml/mahout.git>
follow the instructions in the README.md
download the UR from here:
https://github.com/actionml/universal-recommender.git
<https://github.com/actionml/universal-recommender.git> and checkout branch
0.7.0-SNAPSHOT
replace the line: `resolvers += "Local Repository" at
"file:///Users/pat/.custom-scala-m2/repo”`
with your path to the
local mahout build
build the UR with `pio build` or run the integration test to get sample data
put into PIO `./examples/integration-test`

This will use the released PIO and alpha UR

This will be much easier when it’s released

--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to actionml-user+unsubscr...@googlegroups.com
<mailto:actionml-user+unsubscr...@googlegroups.com>.
To post to this group, send email to actionml-u...@googlegroups.com
<mailto:actionml-u...@googlegroups.com>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/326BE669-574B-45A5-AAA5-6A285BA0B33E%40occamsmachete.com

Re: PIO + ES5 + Universal Recommender

2017-11-01 Thread Pat Ferrel

Ack, I hate this &^%&%^&  touchbar!

What I meant to say was:


We have a version of the universal recommender working with PIO-0.12.0 that is 
ready for brave souls to test. This includes some speedups and quality of 
recommendation improvements, not yet documented. 

Known bugs: exclusion rules not working. This will be fixed before release in 
the next few days

Issues: do not trust the integration test, Lucene and ES have changed their 
scoring method and so you cannot compare the old scores to the new ones. The 
test will be fixed before release but do trust it to populate PIO with some 
sample data you can play with.

You must build PredictionIO with the default parameters so just run 
`./make-distribution` this will require you to install Scala 2.11, Spark 2.1.1 
or greater, ES 5.5.2 or greater, Hadoop 2.6 or greater. If you have issues 
getting pio to build and run send questions to the PIO mailing list. Once PIO 
is running test with `pio status` and `pio app list`. You will need to create 
an app in import your data to run the integration test to get some sample data 
installed in the “handmade” app.

*Backup your data*, moving from ES 1 to ES 5 will delete all data Actually 
even worse it is still in HBase but you can’t get at it so to upgrade so the 
following
`pio export` with pio < 0.12.0 =*Before upgrade!*=
`pio data-delete` all your old apps =*Before upgrade!*=
build and install pio 0.12.0 including all the services =*The point of no 
return!*=
`pio app new …` and `pio import…` any needed datasets
download and build Mahout for Scala 2.11 from this repo: 
https://github.com/actionml/mahout.git  
follow the instructions in the README.md
download the UR from here: 
https://github.com/actionml/universal-recommender.git 
 and checkout branch 
0.7.0-SNAPSHOT
replace the line: `resolvers += "Local Repository" at 
"file:///Users/pat/.custom-scala-m2/repo”` 
 with your path to the 
local mahout build
build the UR with `pio build` or run the integration test to get sample data 
put into PIO `./examples/integration-test`

This will use the released PIO and alpha UR

This will be much easier when it’s released

PIO + ES5 + Universal Recommender

2017-11-01 Thread Pat Ferrel

We have a version working here: 
https://github.com/actionml/universal-recommender.git 

checkout 0.7.0-SNAPSHOT once you pull the repo. 

Known bug: exclusion rules not working. This will be fixed before release in 
the next few days

Issues: do not trust the integration test, Lucene and ES have changed their 
scoring method and so you cannot compare the old scores to the new ones. the 
test will be fixed before release.

You must build the Template with pio v0.12.0 using Scala 2.11, Spark 2.2.1, ES 
5.

[jira] [Comment Edited] (MAHOUT-2020) Maven repo structure malformed

2017-11-01 Thread Pat Ferrel (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234411#comment-16234411
 ] 

Pat Ferrel edited comment on MAHOUT-2020 at 11/1/17 5:22 PM:
-

Nothing to do with SBT, look in the parent POM, it always has 
{scala.combat.version} = 2.10 the rest of the child poms inherit that even if 
they were built for scala 2.11


was (Author: pferrel):
Nothing to do with SBT, look in the parent POM, it always has 
{scala.combat.version} = 2.10 the rest of the child pom inherit that and it is 
wrong when building for scala 2.11

> Maven repo structure malformed
> --
>
> Key: MAHOUT-2020
> URL: https://issues.apache.org/jira/browse/MAHOUT-2020
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
> Environment: Creating a project from maven built Mahout using sbt. 
> Made critical since it seems to block using Mahout with sbt. At least I have 
> found no way to do it.
>Reporter: Pat Ferrel
>Assignee: Trevor Grant
>Priority: Blocker
> Fix For: 0.13.1
>
>
> The maven repo is built with scala 2.10 always in the parent pom's 
> {scala.compat.version} even when you only ask for Scala 2.11, this leads to 
> the 2.11 jars never being found. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 2587 matches

Mail list logo