Re: Unsubscribe

2023-12-05 Thread Pat Ferrel
There is no instruction for "issues@mahout.apache.org 
<mailto:issues@mahout.apache.org>” There are instructions for user, dev, and 
commits but I’ve been getting email from lots of other lists in ASF, some are 
Mahout, like issues@, others are not and who knows how to get off those.

I will assume the magic here is to construct the 
issues-unsubscr...@mahout.apache.org 
<mailto:issues-unsubscr...@mahout.apache.org> and send to this address but why 
is this necessary? Why do ASF lists NOT allow “unsubscribe” subject to start 
the unsubscribe handshake? How many humans would save waisted time on both 
sides of this if infra would fix this globally?

Thanks for listening to one more of my rants. Happy Holidays
:-)


> On Dec 5, 2023, at 12:04 PM, Andrew Musselman  wrote:
> 
> Here's how to do it: https://mahout.apache.org/community/mailing-lists.html
> 
> -- Forwarded message -
> From: Pat Ferrel mailto:p...@occamsmachete.com>>
> Date: Tue, Dec 5, 2023 at 11:58 AM
> Subject: Unsubscribe
> To: mailto:issues@mahout.apache.org>>
> 
> 
> Unsubscribe



Unsubscribe

2023-12-05 Thread Pat Ferrel
Unsubscribe


[jira] [Commented] (MAHOUT-2023) Drivers broken, scopt classes not found

2020-10-20 Thread Pat Ferrel (Jira)


[ 
https://issues.apache.org/jira/browse/MAHOUT-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217820#comment-17217820
 ] 

Pat Ferrel commented on MAHOUT-2023:


I don't install Mahout as a shell process. This only occurs when trying to use 
the cli. So I don't have a good way to test.

At the time this was observed the cli was by far the most common usage of 
Mahout.

In modern times the cli may not need to be supported since more robust notebook 
or REPL solutions exist I would have no problem personally if we wanted to 
remove support for the CLI 

-- BUT --

This would necessitate rewriting lots of the Mahout docs for recommenders and 
I'm not willing to tackle this since publishing the site is in some state of 
blockage afaik.

> Drivers broken, scopt classes not found
> ---
>
> Key: MAHOUT-2023
> URL: https://issues.apache.org/jira/browse/MAHOUT-2023
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
> Environment: any
>Reporter: Pat Ferrel
>Assignee: Trevor Grant
>Priority: Blocker
> Fix For: 14.2
>
>
> Type `mahout spark-itemsimilarity` after Mahout is installed properly and you 
> get a fatal exception due to missing scopt classes.
> Probably a build issue related to incorrect versions of scopt being looked 
> for.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Release 14.1, RC7

2020-09-30 Thread Pat Ferrel
Still haven’t had a chance to test since it will take some experimentation
to figure out jars needed etc. My test is to replace 0.13 with 0.14.1

Still I see no reason to delay the release for my slow testing

+1


From: Andrew Musselman 

Reply: dev@mahout.apache.org  
Date: September 28, 2020 at 7:31:42 AM
To: Mahout Dev List  
Subject:  Re: [DISCUSS] Release 14.1, RC7

Thanks very much Andy!

On Sun, Sep 27, 2020 at 11:38 PM Andrew Palumbo  wrote:

> All,
>
> Apologies on holding this up a bit; I told Andrew 2x that I was in
process
> of testing and 2x got pulled away. I am +1.
>
>
>
> Re: Jakes comments on dev@, I think if we focus on documentation in the
> next release, we can get things clear.
>
>
>
>
>
>
>
>
>
> 
>
> From: Andrew Palumbo 
>
> Sent: Wednesday, September 23, 2020 9:29 PM
>
> To: priv...@mahout.apache.org 
>
> Subject: Re: [DISCUSS] Release 14.1, RC7
>
>
>
> I have a minute tonight, I will test and vote.
>
>
>
>
>
> On Sep 23, 2020 8:47 AM, Andrew Musselman 
> wrote:
>
>
>
> Just a heads up to the Mahout PMC; we have a few beloved lurkers on the
>
> committee who I would love to see at least some release votes from.
>
>
>
> If anyone wants to do a quick screen share to get your current work
machine
>
> up and running with this release candidate I am happy to spend time with
>
> you. Verifying a release is an hour of time from start to finish, and can
>
> be less after you're set up.
>
>
>
> Thanks for considering it!
>
>
>
> Best
>
> Andrew
>
>
>
> On Wed, Sep 23, 2020 at 7:44 AM Trevor Grant 
>
> wrote:
>
>
>
> > I'm back- will test tonight I hope.
>
> >
>
> > Pat can give a binding, and knows the most about the SBT- so I'd like
to
>
> > see a +1 from him (or -1 if it doesn't work).
>
> >
>
> > I have a binding.
>
> >
>
> > Implicitly, AKM would have a binding +1, but the release master
normally
>
> > doesn't vote until the end.
>
> >
>
> > So that would be 3, but it would be worth exploring a new PMC addition.
>
> >
>
> > On Tue, Sep 22, 2020 at 2:25 AM Christofer Dutz <
> christofer.d...@c-ware.de
>
> > >
>
> > wrote:
>
> >
>
> > > Hi all,
>
> > >
>
> > > It’s been 11 days now and so-far I can only see 1 non-binding vote …
I
>
> > > know that Trevor is on vacation at the moment, but what’s up with the
>
> > > others?
>
> > >
>
> > > And I had a chat with Pat on slack about the SBT thing … I think we
>
> > should
>
> > > discuss and whip up a how-to for SBT and Scala users as soon as we
have
>
> > the
>
> > > release out the door.
>
> > >
>
> > > Chris
>
> > >
>
> > >
>
> >
>
>
>
>


Re: [DISCUSS] Dissolve Apache PredictionIO PMC and move project to the Attic

2020-08-31 Thread Pat Ferrel
To try to keep this on-subject I’ll say that I’ve been working on what I once 
saw as a next-gen PIO. It is ASL 2, and has 2 engines that ran in PIO — most 
notably the Universal Recommender. We offered to make the Harness project part 
of PIO a couple years back but didn’t get much interest. It is now in 
v0.6.0-SHAPSHOT. The key difference is that it is designed for the user, rather 
than the Data Scientist.

Check Harness out: https://github.com/actionml/harness Contributors are 
welcome. 

We owe everything to PIO where we proved it could be done.



From: Donald Szeto 
Reply: user@predictionio.apache.org 
Date: August 29, 2020 at 3:45:04 PM
To: d...@predictionio.apache.org 
Cc: user@predictionio.apache.org 
Subject:  Re: [DISCUSS] Dissolve Apache PredictionIO PMC and move project to 
the Attic  

It looks like there is no objection. I will start a vote shortly.

Regards,
Donald

On Mon, Aug 24, 2020 at 1:17 PM Donald Szeto  wrote:
Hi all,

The Apache PredictionIO project had an amazing ride back in its early years. 
Unfortunately, its momentum had declined, and its core technology had fallen 
behind. Although we have received some appeal from the community to help bring 
the project up to speed, the effort is not sufficient.

I think it is about time to archive the project. The proper way to do so is to 
follow the Apache Attic process documented at 
http://attic.apache.org/process.html. This discussion thread is the first step. 
If there is no objection, it will be followed by a voting thread.

Existing users: This move should not impact existing functionality, as the 
source code will still be available through the Apache Attic, in a read-only 
state.

Thank you for your continued support over the years. The project would not be 
possible without your help.

Regards,
Donald

Re: [DISCUSS] Dissolve Apache PredictionIO PMC and move project to the Attic

2020-08-31 Thread Pat Ferrel
To try to keep this on-subject I’ll say that I’ve been working on what I once 
saw as a next-gen PIO. It is ASL 2, and has 2 engines that ran in PIO — most 
notably the Universal Recommender. We offered to make the Harness project part 
of PIO a couple years back but didn’t get much interest. It is now in 
v0.6.0-SHAPSHOT. The key difference is that it is designed for the user, rather 
than the Data Scientist.

Check Harness out: https://github.com/actionml/harness Contributors are 
welcome. 

We owe everything to PIO where we proved it could be done.



From: Donald Szeto 
Reply: u...@predictionio.apache.org 
Date: August 29, 2020 at 3:45:04 PM
To: dev@predictionio.apache.org 
Cc: u...@predictionio.apache.org 
Subject:  Re: [DISCUSS] Dissolve Apache PredictionIO PMC and move project to 
the Attic  

It looks like there is no objection. I will start a vote shortly.

Regards,
Donald

On Mon, Aug 24, 2020 at 1:17 PM Donald Szeto  wrote:
Hi all,

The Apache PredictionIO project had an amazing ride back in its early years. 
Unfortunately, its momentum had declined, and its core technology had fallen 
behind. Although we have received some appeal from the community to help bring 
the project up to speed, the effort is not sufficient.

I think it is about time to archive the project. The proper way to do so is to 
follow the Apache Attic process documented at 
http://attic.apache.org/process.html. This discussion thread is the first step. 
If there is no objection, it will be followed by a voting thread.

Existing users: This move should not impact existing functionality, as the 
source code will still be available through the Apache Attic, in a read-only 
state.

Thank you for your continued support over the years. The project would not be 
possible without your help.

Regards,
Donald

Re: [ANNOUNCE] Mahout Con 2020 (A sub-track of ApacheCon @ Home)

2020-08-12 Thread Pat Ferrel
Big fun. Thanks for putting this together.

I’ll abuse my few Twitter followers with the announcement.


From: Trevor Grant 
Reply: user@mahout.apache.org 
Date: August 12, 2020 at 5:59:45 AM
To: Mahout Dev List , user@mahout.apache.org 

Subject:  [ANNOUNCE] Mahout Con 2020 (A sub-track of ApacheCon @ Home)  

Hey all,  

We got enough people to volunteer for talks that we are going to be putting  
on our very own track at ApacheCon (@Home) this year!  

Check out the schedule here:  
https://www.apachecon.com/acna2020/tracks/mahout.html  

To see the talks live / in real time, please register at:  
https://hopin.to/events/apachecon-home  

But if you can't make it- we plan on pushing all of the recorded sessions  
to the website after.  

Thanks so much everyone, and can't wait to 'see' you there!  

tg  


Re: [ANNOUNCE] Mahout Con 2020 (A sub-track of ApacheCon @ Home)

2020-08-12 Thread Pat Ferrel
Big fun. Thanks for putting this together.

I’ll abuse my few Twitter followers with the announcement.


From: Trevor Grant 
Reply: u...@mahout.apache.org 
Date: August 12, 2020 at 5:59:45 AM
To: Mahout Dev List , u...@mahout.apache.org 

Subject:  [ANNOUNCE] Mahout Con 2020 (A sub-track of ApacheCon @ Home)  

Hey all,  

We got enough people to volunteer for talks that we are going to be putting  
on our very own track at ApacheCon (@Home) this year!  

Check out the schedule here:  
https://www.apachecon.com/acna2020/tracks/mahout.html  

To see the talks live / in real time, please register at:  
https://hopin.to/events/apachecon-home  

But if you can't make it- we plan on pushing all of the recorded sessions  
to the website after.  

Thanks so much everyone, and can't wait to 'see' you there!  

tg  


Memory allocation

2020-04-17 Thread Pat Ferrel
I have used Spark for several years and realize from recent chatter on this 
list that I don’t really understand how it uses memory.

Specifically is spark.executor.memory and spark.driver.memory taken from the 
JVM heap when does Spark take memory from JVM heap and when it is from off JVM 
heap.

Since spark.executor.memory and spark.driver.memory are job params, I have 
always assumed that the required memory was off-JVM-heap.  Or am I on the wrong 
track altogether?

Can someone point me to a discussion of this?

thanks

Re: IDE suitable for Spark

2020-04-07 Thread Pat Ferrel
IntelliJ Scala works well when debugging master=local. Has anyone used it for 
remote/cluster debugging? I’ve heard it is possible...


From: Luiz Camargo 
Reply: Luiz Camargo 
Date: April 7, 2020 at 10:26:35 AM
To: Dennis Suhari 
Cc: yeikel valdes , zahidr1...@gmail.com 
, user@spark.apache.org 
Subject:  Re: IDE suitable for Spark  

I have used IntelliJ Spark/Scala with the sbt tool

On Tue, Apr 7, 2020 at 1:18 PM Dennis Suhari  
wrote:
We are using Pycharm resp. R Studio with Spark libraries to submit Spark Jobs. 

Von meinem iPhone gesendet

Am 07.04.2020 um 18:10 schrieb yeikel valdes :



Zeppelin is not an IDE but a notebook.  It is helpful to experiment but it is 
missing a lot of the features that we expect from an IDE.

Thanks for sharing though. 

 On Tue, 07 Apr 2020 04:45:33 -0400 zahidr1...@gmail.com wrote 

When I first logged on I asked if there was a suitable IDE for Spark.
I did get a couple of responses.  
Thanks.  

I did actually find one which is suitable IDE for spark.  
That is  Apache Zeppelin.

One of many reasons it is suitable for Apache Spark is.
The  up and running Stage which involves typing bin/zeppelin-daemon.sh start
Go to browser and type http://localhost:8080  
That's it!

Then to
Hit the ground running   
There are also ready to go Apache Spark examples
showing off the type of functionality one will be using in real life production.

Zeppelin comes with  embedded Apache Spark  and scala as default interpreter 
with 20 + interpreters.
I have gone on to discover there are a number of other advantages for real time 
production
environment with Zeppelin offered up by other Apache Products.

Backbutton.co.uk
¯\_(ツ)_/¯  
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
makeuse.org



--  


Prof. Luiz Camargo
Educador - Computação



Re: PredictionIO ASF Board Report for Mar 2020

2020-03-19 Thread Pat Ferrel
PredictionIO is scalable BY SCALING ITS SUB-SERVICES. Running on a single
machine sounds like no scaling has been executed or even planned.

How do you scale ANY system?
1) vertical scaling: make the instance larger with more cores, more disk,
and most importantly more memory. Increase whatever resource you need most
but all will be affected eventually.
2) move each service to its own instance. Move the DB, Spark, etc (depends
on what you are using) Then you can scale the sub-service (the ones PIO
uses) independently as needed.

Without a scaling plan you must trim your data to fit the system you have.
For instance save only a few months of data. Unfortunately PIO has no
automatic way to do this, like a TTL. We created a template that you can
run to trim your db by dropping old data. Unfortunately we have not kept up
with PIO versions since we have moved to another ML server that DOES have
TTLs.

If anyone wants to upgrade the template it was last used with PIO 0.12.x
and is here: https://github.com/actionml/db-cleaner

If you continually add data to a bucket it will eventually overflow, how
could it be any other way?



From: Sami Serbey 

Reply: user@predictionio.apache.org 

Date: March 19, 2020 at 7:43:08 AM
To: user@predictionio.apache.org 

Subject:  Re: PredictionIO ASF Board Report for Mar 2020

Hello!

My knowledge to predictionio is limited. I was able to set up a
predictionIO server and run on it two templates, the recommendation and
similar item template. The server is on production in my company and we
were having good results. Suddenly, as we feed data to the server, our
cloud machine memory got full and we can't have new data anymore nor we can
process this data. An error message on ubuntu state: "No space left on
device".

I am deploying this server on a single machine without any cluster or the
help of docker. Do you have any suggestion to solve this issue? Also, is
there a way to clean the machine from old data it has?

As a final note, my knowledge in the data engineer and machine learning
field is limited. I understand scala and can work with spark. However, I am
willing to dig deeper into predictionio. Do you think there is a way I can
contribute to the community in one way or another? Or you're just looking
for true experts in order to avoid moving the project to attic?

Regards
Sami Serbey
--
*From:* Donald Szeto 
*Sent:* Tuesday, March 10, 2020 8:26 PM
*To:* user@predictionio.apache.org ;
d...@predictionio.apache.org 
*Subject:* PredictionIO ASF Board Report for Mar 2020

Hi all,

Please take a look at the draft report below and make your comments or
edits as you see fit. The draft will be submitted on Mar 11, 2020.

Regards,
Donald

## Description:
The mission of Apache Predictionio is the creation and maintenance of
software
related to a machine learning server built on top of state-of-the-art open
source stack, that enables developers to manage and deploy production-ready
predictive services for various kinds of machine learning tasks

## Issues:
Update: A community member, who's a committer and PMC of another Apache
project, has expressed interest in helping. The member has been engaged and
we are waiting for actions from that member.

Last report: No PMC chair nominee was nominated a week after the PMC chair
expressed
intention to resign from the chair on the PMC mailing list.

## Membership Data:
Apache PredictionIO was founded 2017-10-17 (2 years ago)
There are currently 29 committers and 28 PMC members in this project.
The Committer-to-PMC ratio is roughly 8:7.

Community changes, past quarter:
- No new PMC members. Last addition was Andrew Kyle Purtell on 2017-10-17.
- No new committers were added.

## Project Activity:
Sparse activities only on mailing list.

Recent releases:

0.14.0 was released on 2019-03-11.
0.13.0 was released on 2018-09-20.
0.12.1 was released on 2018-03-11.

## Community Health:
Update: A community member, who's a committer and PMC of another Apache
project, has expressed interest in helping. The member has been engaged and
we are waiting for actions from that member to see if a nomination to PMC
and chair would be appropriate.

Last report: We are seeking new leadership for the project at the moment to
bring it out
of maintenance mode. Moving to the attic would be the last option.


Re: Livy on Kubernetes support

2020-01-14 Thread Pat Ferrel
+1 from another user fwiw. We also have livy containers and helm charts. The 
real problem is deploying a Spark Cluster in k8s. We know of no working images 
for this. The Spark team seems focused on deploying Jobs with k8s, which is 
fine but is not enough. We need to deploy Spark itself.  We created our own 
containers and charts for this too.

Is anyone interested in sharing images that work with k8s for Livy and/or 
Spark? Ours are all ASF licensed OSS.

From: Marco Gaido 
Reply: dev@livy.incubator.apache.org 
Date: January 14, 2020 at 2:35:34 PM
To: dev@livy.incubator.apache.org 
Subject:  Re: Livy on Kubernetes support  

Hi Aliaksandr,  

thanks for your email and you work on this feature. As I mentioned to you  
in the PR, I agree with you on the usefulness of this feature and you have  
a big +1 from me having it in Livy. Unfortunately, it is not my area of  
expertise, so I don't feel confident merging it without other reviewers  
taking a careful look at it.  

For the future, I think a better approach would be first to discuss and  
define the architecture with the community, so that it is shared and  
accepted by the whole community before the PR is out. This helps also  
getting people involved and makes easier having them being able to review  
the PR. Anyway, after you have split the PRs, I think they are reasonable  
and we can discuss on them.  

Looking forward to have your contribution in Livy.  

Thanks,  
Marco  

Il giorno mar 14 gen 2020 alle ore 12:48 Aliaksandr Sasnouskikh <  
jahstreetl...@gmail.com> ha scritto:  

> Hi community,  
>  
> About a year ago I've started to work on the patch to Apache Livy for Spark  
> on Kubernetes support in the scope of the project I've been working on.  
> Since that time I've created a PR  
> https://github.com/apache/incubator-livy/pull/167 which have already been  
> discussed and reviewed a lot. After finalizing the work in the result of  
> the PR discussions I've started to split the work introduced in the base PR  
> into smaller pieces to make it easier to separate the core and aux  
> functionality, and as a result - easier to review and merge. The first core  
> PR is https://github.com/apache/incubator-livy/pull/249.  
>  
> Also I've created the repos with Docker images (  
> https://github.com/jahstreet/spark-on-kubernetes-docker) and Helm charts (  
> https://github.com/jahstreet/spark-on-kubernetes-helm) with the possible  
> stack the users may want to use Livy on Kubernetes with, which potentially  
> in the future can be partially moved to Livy repo to keep the artifacts  
> required to run Livy on Kubernetes in a single place.  
>  
> Until now I've received the positive feedback from more than 10 projects  
> about the usage of the patch. Several of them could be found in the  
> discussions of the base PR. Also my repos supporting this feature have  
> around 35 stars and 15 forks in total and were referenced in Spark related  
> Stackoverflow and Kubernetes slack channel discussions. So the users use it  
> already.  
>  
> You may think "What this guy wants from us then!?"... Well, I would like to  
> ask for your time and expertise to help push it forward and ideally make it  
> merged.  
>  
> Probably before I started coding I should have checked with the  
> contributors if this feature may have value for the project and how is  
> better to implement it, but I hope it is never too late;) So I'm here to  
> share with you the the thoughts behind it.  
>  
> The idea of Livy on Kubernetes is simply to replicate the logic it has for  
> Yarn API to Kubernetes API, which can be easily done since the interfaces  
> for the Yarn API are really similar to the ones of the Kubernetes.  
> Nevertheless this easy-to-do patch opens Livy the doors to Kubernetes which  
> seems to be really useful for the community taking into account the  
> popularity of Kubernetes itself and the latest releases of Spark supporting  
> Kubernetes as well.  
>  
> Proposed Livy job submission flow:  
> - Generate appTag and add  
> `spark.kubernetes.[driver/executor].label.SPARK_APP_TAG_LABEL` to Spark  
> config  
> - Run spark-submit in cluster-mode with Kubernetes master  
> - Start monitoring thread which resolves Spark Driver and Executor Pods  
> using the `SPARK_APP_TAG_LABEL`s assigned during the job submission  
> - Create additional Kubernetes resource if necessary: Spark UI service,  
> Ingress, CRDs, etc.  
> - Fetch Spark Pods statuses, Driver logs and other diagnostics information  
> while Spark Pods are running  
> - Remove Spark job resources (completed/failed Driver Pod, Service,  
> ConfigMap, etc.) from the cluster after the job completion/failure after  
> the configured timeout  
>  
> The core functionality (covered by  
> https://github.com/apache/incubator-livy/pull/249):  
> - Submission of Batch jobs and Interactive sessions  
> - Caching Driver logs and Kubernetes Pods diagnostics  
>  
> Aux features (introduced in  

Re: Possible missing mentor(s)

2019-09-01 Thread Pat Ferrel
Seems like some action should be taken before 2 years, even if it is to
close the PR because it is not appropriate. Isn’t this the responsibility
of the chair to guard against committer changes where the contributor is
still willing? Or if a mentor is guiding the PR they should help it get
unstalled if the contributor is still willing to make changes.

The point (2 year old PRs) seems well taken. The question should be; what
can be done about this?

For what it’s worth, we are just starting to use Livy and wish it was part
of Spark. We would like to treat Spark as a “microservice” as a compute
engine. The Spark team seems to want to make Spark integral to the
architecture of ALL applications that use it. Very odd from our point of
view.

So to integrate Livy we deeply hope it doesn’t fall into disrepair and are
willing to help when we run into something.


From: Sid Anand  
Reply: dev@livy.incubator.apache.org 

Date: September 1, 2019 at 11:19:00 AM
To: dev@livy.incubator.apache.org 

Subject:  Re: Possible missing mentor(s)

"Second, if someone has a *good* and *large* contribution history, and
actively participates in community, we will add him without doubt. Third,
2-year old open PRs doesn't stand anything, some reviewers left the
community and PRs get staled, it is quite common, especially in large
community."

Meisam has 7 closed and 3 open PRs - of the 4 oldest open PRs in Livy (I
see 4 in 2017), 2 are his. He's ranked #10 in the contributor list --
It's not a large contribution history mostly because it takes so long to
merge and he has been consistently active for 2 years. The size of the
community doesn't seem a factor here with <200 closed PRs and <50
contributors.

How are you prioritizing PR merges if you think having a 2 year old open PR
is okay and you don't a ton of open PRs?
-s

On Sun, Sep 1, 2019 at 2:25 AM Saisai Shao  wrote:

> First, we're scaling the PR review, but we only have few active
committers,
> so the merge may not be fast.
> Second, if someone has a *good* and *large* contribution history, and
> actively participates in community, we will add him without doubt.
> Third, 2-year old open PRs doesn't stand anything, some reviewers left
the
> community and PRs get staled, it is quite common, especially in large
> community.
>
> Sid Anand  于2019年9月1日周日 下午4:46写道:
>
> > Apache projects promote contributors to committers based on
contributions
> > made, not on an expectation of future activity. That's the Apache way
per
> > my understanding. Over time, folks become inactive and busy -- life
> > happens, I get it. May I ask what are you folks doing to scale PR
review
> > and merging? Are you adding new committers? Do you feel that 2-year old
> > open PRs is where you wish to be and is the right way to grow a
> community?
> >
> > On Sun, Sep 1, 2019 at 1:46 AM Sid Anand  wrote:
> >
> > > Apache projects promote contributors to committers based on
> contributions
> > > made, not on an expectation of future activity. That's the Apache way
> per
> > > my understanding. Over time, folks become inactive and busy -- life
> > > happens, I get it. May I ask what are you folks doing to scale PR
> review
> > > and merging? Are you adding new committers? Do you feel that 2-year
> old
> > > open PRs is where you wish to be and is the right way to grow a
> > community?
> > >
> > > -s
> > >
> > > On Sun, Sep 1, 2019 at 12:59 AM Saisai Shao 
> > > wrote:
> > >
> > >> It's unfair to say there's underlying bias. Livy project is a small
> > >> project, the contributor diversity may not be as rich as popular
> project
> > >> like Spark, it is not fair to say that the contributions only limits
> to
> > >> someones, so project is biased. There're many small Apache project
> which
> > >> has only few contributors, can we say those projects are biased?
Also
> > for
> > >> years the committers have joined and left the community, it is hard
to
> > >> track every contribution in time, as we're not a full-time Livy open
> > >> source
> > >> contributors. I also have several PRs left unreviewed for years.
It's
> > >> quite
> > >> common even for large project like Spark, Hadoop, there're so many
> > >> un-merged PRs left for several years. It's unfair to say the project
> is
> > >> biased, unhealthy because of some un-merged PRs.
> > >>
> > >> The community is small but free and open, I would deny that the
> > community
> > >> is unhealthy especially biased, this is an irresponsible and
> subjective
> > >> word.
> > >>
> > >> Sid Anand  于2019年9月1日周日 上午4:20写道:
> > >>
> > >> > Folks!
> > >> > We've (several devs, myself included) contacted the livy dev list
> and
> > >> the
> > >> > owners DL several times. Our PRs stagnated over a few years. Livy
> is a
> > >> > central component in PayPal's Data Infra (Our data footprint is
80+
> > PB).
> > >> > The project seems pretty unhealthy. After a few years, this dev
> moved
> > on
> > >> > and the state of our PR may be harder to define, with both
absentee
> > >> > 

Re: k8s orchestrating Spark service

2019-07-03 Thread Pat Ferrel
Thanks for the in depth explanation.

These methods would require us to architect our Server around Spark and it
is actually designed to be independent of the ML implementation. SparkML is
an important algo source, to be sure, but so is TensorFlow, and Python
non-spark libs among others. So Spark stays at arms length in a
microservices pattern. Doing this with access to Job status and management
is why Livy and the (Spark) Job Server exist. To us the ideal is treating
Spark like a compute server that will respond to a service API for job
submittal and control.

None of the above is solved by k8s Spark. Further we find that the Spark
Programatic API does not support deploy mode = “cluster”. This means we
have to take a simple part of our code and partition it into new Jars only
to get spark-submit to work. To help with Job tracking and management when
you are not using the Programatic API we look to Livy. I guess if you ask
our opinion of spark-submit, we’d (selfishly) say it hides architectural
issues that should be solved in the Spark Programatic API but the
popularity of spark-submit is causing the community to avoid these or just
not see or care about them. I guess we’ll see if Spark behind Livy gives us
what we want.

Maybe this is unusual but we see Spark as a service, not an integral
platform. We also see Kubernetes as very important but optional for HA or
when you want to scale horizontally, basically when vertical is not
sufficient. Vertical scaling is more cost effective so Docker Compose is a
nice solution for simpler, Kubernetes-less deployments.

So if we are agnostic about the job master, and communicate through Livy,
we are back to orchestrating services with Docker and Kubernetes. If k8s
becomes a super duper job master, great! But it doesn’t solve todays
question.


From: Matt Cheah  
Reply: Matt Cheah  
Date: July 1, 2019 at 5:14:05 PM
To: Pat Ferrel  ,
user@spark.apache.org  
Subject:  Re: k8s orchestrating Spark service

> We’d like to deploy Spark Workers/Executors and Master (whatever master
is easiest to talk about since we really don’t care) in pods as we do with
the other services we use. Replace Spark Master with k8s if you insist. How
do the executors get deployed?



When running Spark against Kubernetes natively, the Spark library handles
requesting executors from the API server. So presumably one would only need
to know how to start the driver in the cluster – maybe spark-operator,
spark-submit, or just starting the pod and making a Spark context in client
mode with the right parameters. From there, the Spark scheduler code knows
how to interface with the API server and request executor pods according to
the resource requests configured in the app.



> We have a machine Learning Server. It submits various jobs through the
Spark Scala API. The Server is run in a pod deployed from a chart by k8s.
It later uses the Spark API to submit jobs. I guess we find spark-submit to
be a roadblock to our use of Spark and the k8s support is fine but how do
you run our Driver and Executors considering that the Driver is part of the
Server process?



It depends on how the server runs the jobs:

   - If each job is meant to be a separate forked driver pod / process: The
   ML server code can use the SparkLauncher API
   
<https://spark.apache.org/docs/latest/api/java/org/apache/spark/launcher/SparkLauncher.html>
   and configure the Spark driver through that API. Set the master to point to
   the Kubernetes API server and set the parameters for credentials according
   to your setup. SparkLauncher is a thin layer on top of spark-submit; a
   Spark distribution has to be packaged with the ML server image and
   SparkLauncher would point to the spark-submit script in said distribution.
   - If all jobs run inside the same driver, that being the ML server: One
   has to start the ML server with the right parameters to point to the
   Kubernetes master. Since the ML server is a driver, one has the option to
   use spark-submit or SparkLauncher to deploy the ML server itself.
   Alternatively one can use a custom script to start the ML server, then the
   ML server process has to create a SparkContext object parameterized against
   the Kubernetes server in question.



I hope this helps!



-Matt Cheah

*From: *Pat Ferrel 
*Date: *Monday, July 1, 2019 at 5:05 PM
*To: *"user@spark.apache.org" , Matt Cheah <
mch...@palantir.com>
*Subject: *Re: k8s orchestrating Spark service



We have a machine Learning Server. It submits various jobs through the
Spark Scala API. The Server is run in a pod deployed from a chart by k8s.
It later uses the Spark API to submit jobs. I guess we find spark-submit to
be a roadblock to our use of Spark and the k8s support is fine but how do
you run our Driver and Executors considering that the Driver is part of the
Server process?



Maybe we are talking past each other with some mistaken assumptions (on my
part perhaps).







From: Pat

Re: JAVA_HOME is not set

2019-07-03 Thread Pat Ferrel
Oops, should have said: "I may have missed something but I don’t recall PIO
being released by Apache as an ASF maintained container/image release
artifact."


From: Pat Ferrel  
Reply: user@predictionio.apache.org 

Date: July 3, 2019 at 11:16:43 AM
To: Wei Chen  ,
d...@predictionio.apache.org 
, user@predictionio.apache.org
 
Subject:  Re: JAVA_HOME is not set

BTW the container you use is supported by the container author, if at all.

I may have missed something but I don’t recall PIO being released by Apache
as an ASF maintained release artifact.

I wish ASF projects would publish Docker Images made for real system
integration, but IIRC PIO does not.


From: Wei Chen  
Reply: d...@predictionio.apache.org 

Date: July 2, 2019 at 5:14:38 PM
To: user@predictionio.apache.org 

Cc: d...@predictionio.apache.org 

Subject:  Re: JAVA_HOME is not set

Add these steps in your docker file.
https://vitux.com/how-to-setup-java_home-path-in-ubuntu/

Best Regards
Wei

On Wed, Jul 3, 2019 at 5:06 AM Alexey Grachev <
alexey.grac...@turbinekreuzberg.com> wrote:

> Hello,
>
>
> I installed newest pio 0.14 with docker. (ubuntu 18.04)
>
> After starting pio -> I get "JAVA_HOME is not set"
>
>
> Does anyone know where in docker config I have to setup the JAVA_HOME
> env variable?
>
>
> Thanks a lot!
>
>
> Alexey
>
>
>


Re: JAVA_HOME is not set

2019-07-03 Thread Pat Ferrel
Oops, should have said: "I may have missed something but I don’t recall PIO
being released by Apache as an ASF maintained container/image release
artifact."


From: Pat Ferrel  
Reply: u...@predictionio.apache.org 

Date: July 3, 2019 at 11:16:43 AM
To: Wei Chen  ,
dev@predictionio.apache.org 
, u...@predictionio.apache.org
 
Subject:  Re: JAVA_HOME is not set

BTW the container you use is supported by the container author, if at all.

I may have missed something but I don’t recall PIO being released by Apache
as an ASF maintained release artifact.

I wish ASF projects would publish Docker Images made for real system
integration, but IIRC PIO does not.


From: Wei Chen  
Reply: dev@predictionio.apache.org 

Date: July 2, 2019 at 5:14:38 PM
To: u...@predictionio.apache.org 

Cc: dev@predictionio.apache.org 

Subject:  Re: JAVA_HOME is not set

Add these steps in your docker file.
https://vitux.com/how-to-setup-java_home-path-in-ubuntu/

Best Regards
Wei

On Wed, Jul 3, 2019 at 5:06 AM Alexey Grachev <
alexey.grac...@turbinekreuzberg.com> wrote:

> Hello,
>
>
> I installed newest pio 0.14 with docker. (ubuntu 18.04)
>
> After starting pio -> I get "JAVA_HOME is not set"
>
>
> Does anyone know where in docker config I have to setup the JAVA_HOME
> env variable?
>
>
> Thanks a lot!
>
>
> Alexey
>
>
>


Re: JAVA_HOME is not set

2019-07-03 Thread Pat Ferrel
BTW the container you use is supported by the container author, if at all.

I may have missed something but I don’t recall PIO being released by Apache
as an ASF maintained release artifact.

I wish ASF projects would publish Docker Images made for real system
integration, but IIRC PIO does not.


From: Wei Chen  
Reply: d...@predictionio.apache.org 

Date: July 2, 2019 at 5:14:38 PM
To: user@predictionio.apache.org 

Cc: d...@predictionio.apache.org 

Subject:  Re: JAVA_HOME is not set

Add these steps in your docker file.
https://vitux.com/how-to-setup-java_home-path-in-ubuntu/

Best Regards
Wei

On Wed, Jul 3, 2019 at 5:06 AM Alexey Grachev <
alexey.grac...@turbinekreuzberg.com> wrote:

> Hello,
>
>
> I installed newest pio 0.14 with docker. (ubuntu 18.04)
>
> After starting pio -> I get "JAVA_HOME is not set"
>
>
> Does anyone know where in docker config I have to setup the JAVA_HOME
> env variable?
>
>
> Thanks a lot!
>
>
> Alexey
>
>
>


Re: JAVA_HOME is not set

2019-07-03 Thread Pat Ferrel
BTW the container you use is supported by the container author, if at all.

I may have missed something but I don’t recall PIO being released by Apache
as an ASF maintained release artifact.

I wish ASF projects would publish Docker Images made for real system
integration, but IIRC PIO does not.


From: Wei Chen  
Reply: dev@predictionio.apache.org 

Date: July 2, 2019 at 5:14:38 PM
To: u...@predictionio.apache.org 

Cc: dev@predictionio.apache.org 

Subject:  Re: JAVA_HOME is not set

Add these steps in your docker file.
https://vitux.com/how-to-setup-java_home-path-in-ubuntu/

Best Regards
Wei

On Wed, Jul 3, 2019 at 5:06 AM Alexey Grachev <
alexey.grac...@turbinekreuzberg.com> wrote:

> Hello,
>
>
> I installed newest pio 0.14 with docker. (ubuntu 18.04)
>
> After starting pio -> I get "JAVA_HOME is not set"
>
>
> Does anyone know where in docker config I have to setup the JAVA_HOME
> env variable?
>
>
> Thanks a lot!
>
>
> Alexey
>
>
>


Re: k8s orchestrating Spark service

2019-07-01 Thread Pat Ferrel
We have a machine Learning Server. It submits various jobs through the
Spark Scala API. The Server is run in a pod deployed from a chart by k8s.
It later uses the Spark API to submit jobs. I guess we find spark-submit to
be a roadblock to our use of Spark and the k8s support is fine but how do
you run our Driver and Executors considering that the Driver is part of the
Server process?

Maybe we are talking past each other with some mistaken assumptions (on my
part perhaps).



From: Pat Ferrel  
Reply: Pat Ferrel  
Date: July 1, 2019 at 4:57:20 PM
To: user@spark.apache.org  , Matt
Cheah  
Subject:  Re: k8s orchestrating Spark service

k8s as master would be nice but doesn’t solve the problem of running the
full cluster and is an orthogonal issue.

We’d like to deploy Spark Workers/Executors and Master (whatever master is
easiest to talk about since we really don’t care) in pods as we do with the
other services we use. Replace Spark Master with k8s if you insist. How do
the executors get deployed?

We have our own containers that almost work for 2.3.3. We have used this
before with older Spark so we are reasonably sure it makes sense. We just
wonder if our own image builds and charts are the best starting point.

Does anyone have something they like?


From: Matt Cheah  
Reply: Matt Cheah  
Date: July 1, 2019 at 4:45:55 PM
To: Pat Ferrel  ,
user@spark.apache.org  
Subject:  Re: k8s orchestrating Spark service

Sorry, I don’t quite follow – why use the Spark standalone cluster as an
in-between layer when one can just deploy the Spark application directly
inside the Helm chart? I’m curious as to what the use case is, since I’m
wondering if there’s something we can improve with respect to the native
integration with Kubernetes here. Deploying on Spark standalone mode in
Kubernetes is, to my understanding, meant to be superseded by the native
integration introduced in Spark 2.4.



*From: *Pat Ferrel 
*Date: *Monday, July 1, 2019 at 4:40 PM
*To: *"user@spark.apache.org" , Matt Cheah <
mch...@palantir.com>
*Subject: *Re: k8s orchestrating Spark service



Thanks Matt,



Actually I can’t use spark-submit. We submit the Driver programmatically
through the API. But this is not the issue and using k8s as the master is
also not the issue though you may be right about it being easier, it
doesn’t quite get to the heart.



We want to orchestrate a bunch of services including Spark. The rest work,
we are asking if anyone has seen a good starting point for adding Spark as
a k8s managed service.




From: Matt Cheah  
Reply: Matt Cheah  
Date: July 1, 2019 at 3:26:20 PM
To: Pat Ferrel  ,
user@spark.apache.org  
Subject:  Re: k8s orchestrating Spark service



I would recommend looking into Spark’s native support for running on
Kubernetes. One can just start the application against Kubernetes directly
using spark-submit in cluster mode or starting the Spark context with the
right parameters in client mode. See
https://spark.apache.org/docs/latest/running-on-kubernetes.html
[spark.apache.org]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_running-2Don-2Dkubernetes.html=DwMFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=4XyH4cxucBNQAlSaHyR4gXJbHIo9g9vcur4VzBCYkwk=Q6mv_pZUq3UmxJU6EiJYJvG8L44WBeWJyAnw3vG0GBw=>



I would think that building Helm around this architecture of running Spark
applications would be easier than running a Spark standalone cluster. But
admittedly I’m not very familiar with the Helm technology – we just use
spark-submit.



-Matt Cheah

*From: *Pat Ferrel 
*Date: *Sunday, June 30, 2019 at 12:55 PM
*To: *"user@spark.apache.org" 
*Subject: *k8s orchestrating Spark service



We're trying to setup a system that includes Spark. The rest of the
services have good Docker containers and Helm charts to start from.



Spark on the other hand is proving difficult. We forked a container and
have tried to create our own chart but are having several problems with
this.



So back to the community… Can anyone recommend a Docker Container + Helm
Chart for use with Kubernetes to orchestrate:

   - Spark standalone Master
   - several Spark Workers/Executors

This not a request to use k8s to orchestrate Spark Jobs, but the service
cluster itself.



Thanks


Re: k8s orchestrating Spark service

2019-07-01 Thread Pat Ferrel
k8s as master would be nice but doesn’t solve the problem of running the
full cluster and is an orthogonal issue.

We’d like to deploy Spark Workers/Executors and Master (whatever master is
easiest to talk about since we really don’t care) in pods as we do with the
other services we use. Replace Spark Master with k8s if you insist. How do
the executors get deployed?

We have our own containers that almost work for 2.3.3. We have used this
before with older Spark so we are reasonably sure it makes sense. We just
wonder if our own image builds and charts are the best starting point.

Does anyone have something they like?


From: Matt Cheah  
Reply: Matt Cheah  
Date: July 1, 2019 at 4:45:55 PM
To: Pat Ferrel  ,
user@spark.apache.org  
Subject:  Re: k8s orchestrating Spark service

Sorry, I don’t quite follow – why use the Spark standalone cluster as an
in-between layer when one can just deploy the Spark application directly
inside the Helm chart? I’m curious as to what the use case is, since I’m
wondering if there’s something we can improve with respect to the native
integration with Kubernetes here. Deploying on Spark standalone mode in
Kubernetes is, to my understanding, meant to be superseded by the native
integration introduced in Spark 2.4.



*From: *Pat Ferrel 
*Date: *Monday, July 1, 2019 at 4:40 PM
*To: *"user@spark.apache.org" , Matt Cheah <
mch...@palantir.com>
*Subject: *Re: k8s orchestrating Spark service



Thanks Matt,



Actually I can’t use spark-submit. We submit the Driver programmatically
through the API. But this is not the issue and using k8s as the master is
also not the issue though you may be right about it being easier, it
doesn’t quite get to the heart.



We want to orchestrate a bunch of services including Spark. The rest work,
we are asking if anyone has seen a good starting point for adding Spark as
a k8s managed service.




From: Matt Cheah  
Reply: Matt Cheah  
Date: July 1, 2019 at 3:26:20 PM
To: Pat Ferrel  ,
user@spark.apache.org  
Subject:  Re: k8s orchestrating Spark service



I would recommend looking into Spark’s native support for running on
Kubernetes. One can just start the application against Kubernetes directly
using spark-submit in cluster mode or starting the Spark context with the
right parameters in client mode. See
https://spark.apache.org/docs/latest/running-on-kubernetes.html
[spark.apache.org]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_running-2Don-2Dkubernetes.html=DwMFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=4XyH4cxucBNQAlSaHyR4gXJbHIo9g9vcur4VzBCYkwk=Q6mv_pZUq3UmxJU6EiJYJvG8L44WBeWJyAnw3vG0GBw=>



I would think that building Helm around this architecture of running Spark
applications would be easier than running a Spark standalone cluster. But
admittedly I’m not very familiar with the Helm technology – we just use
spark-submit.



-Matt Cheah

*From: *Pat Ferrel 
*Date: *Sunday, June 30, 2019 at 12:55 PM
*To: *"user@spark.apache.org" 
*Subject: *k8s orchestrating Spark service



We're trying to setup a system that includes Spark. The rest of the
services have good Docker containers and Helm charts to start from.



Spark on the other hand is proving difficult. We forked a container and
have tried to create our own chart but are having several problems with
this.



So back to the community… Can anyone recommend a Docker Container + Helm
Chart for use with Kubernetes to orchestrate:

   - Spark standalone Master
   - several Spark Workers/Executors

This not a request to use k8s to orchestrate Spark Jobs, but the service
cluster itself.



Thanks


Re: k8s orchestrating Spark service

2019-07-01 Thread Pat Ferrel
Thanks Matt,

Actually I can’t use spark-submit. We submit the Driver programmatically
through the API. But this is not the issue and using k8s as the master is
also not the issue though you may be right about it being easier, it
doesn’t quite get to the heart.

We want to orchestrate a bunch of services including Spark. The rest work,
we are asking if anyone has seen a good starting point for adding Spark as
a k8s managed service.


From: Matt Cheah  
Reply: Matt Cheah  
Date: July 1, 2019 at 3:26:20 PM
To: Pat Ferrel  ,
user@spark.apache.org  
Subject:  Re: k8s orchestrating Spark service

I would recommend looking into Spark’s native support for running on
Kubernetes. One can just start the application against Kubernetes directly
using spark-submit in cluster mode or starting the Spark context with the
right parameters in client mode. See
https://spark.apache.org/docs/latest/running-on-kubernetes.html



I would think that building Helm around this architecture of running Spark
applications would be easier than running a Spark standalone cluster. But
admittedly I’m not very familiar with the Helm technology – we just use
spark-submit.



-Matt Cheah

*From: *Pat Ferrel 
*Date: *Sunday, June 30, 2019 at 12:55 PM
*To: *"user@spark.apache.org" 
*Subject: *k8s orchestrating Spark service



We're trying to setup a system that includes Spark. The rest of the
services have good Docker containers and Helm charts to start from.



Spark on the other hand is proving difficult. We forked a container and
have tried to create our own chart but are having several problems with
this.



So back to the community… Can anyone recommend a Docker Container + Helm
Chart for use with Kubernetes to orchestrate:

   - Spark standalone Master
   - several Spark Workers/Executors

This not a request to use k8s to orchestrate Spark Jobs, but the service
cluster itself.



Thanks


k8s orchestrating Spark service

2019-06-30 Thread Pat Ferrel
We're trying to setup a system that includes Spark. The rest of the
services have good Docker containers and Helm charts to start from.

Spark on the other hand is proving difficult. We forked a container and
have tried to create our own chart but are having several problems with
this.

So back to the community… Can anyone recommend a Docker Container + Helm
Chart for use with Kubernetes to orchestrate:

   - Spark standalone Master
   - several Spark Workers/Executors

This not a request to use k8s to orchestrate Spark Jobs, but the service
cluster itself.

Thanks


Re: run new spark version on old spark cluster ?

2019-05-20 Thread Pat Ferrel
It is always dangerous to run a NEWER version of code on an OLDER cluster.
The danger increases with the semver change and this one is not just a
build #. In other word 2.4 is considered to be a fairly major change from
2.3. Not much else can be said.


From: Nicolas Paris  
Reply: user@spark.apache.org  
Date: May 20, 2019 at 11:02:49 AM
To: user@spark.apache.org  
Subject:  Re: run new spark version on old spark cluster ?

> you will need the spark version you intend to launch with on the machine
you
> launch from and point to the correct spark-submit

does this mean to install a second spark version (2.4) on the cluster ?

thanks

On Mon, May 20, 2019 at 01:58:11PM -0400, Koert Kuipers wrote:
> yarn can happily run multiple spark versions side-by-side
> you will need the spark version you intend to launch with on the machine
you
> launch from and point to the correct spark-submit
>
> On Mon, May 20, 2019 at 1:50 PM Nicolas Paris 
wrote:
>
> Hi
>
> I am wondering whether that's feasible to:
> - build a spark application (with sbt/maven) based on spark2.4
> - deploy that jar on yarn on a spark2.3 based installation
>
> thanks by advance,
>
>
> --
> nicolas
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Fwd: Spark Architecture, Drivers, & Executors

2019-05-17 Thread Pat Ferrel
In order to create an application that executes code on Spark we have a
long lived process. It periodically runs jobs programmatically on a Spark
cluster, meaning it does not use spark-submit. The Jobs it executes have
varying requirements for memory so we want to have the Spark Driver run in
the cluster.

This kind of architecture does not work very well with Spark as we
understand it. The issue is that there is no way to run in
deployMode=cluster. This setting is ignored when launching a jobs
programmatically (why is it not an exception?). This in turn means that our
launching application needs to be run on a machine that is big enough to
run the worst case Spark Driver. This is completely impractical due to our
use case (a generic always on Machine Learning Server).

What we would rather do is have the Scala closure that has access to the
Spark Context be treated as the Spark Driver and run in the cluster. There
seems to be no way to do this with off-the-shelf Spark.

This seems like a very common use case but maybe we are too close to it. We
are aware of the Job Server and Apache Livy, which seem to give us what we
need.

Are these the best solutions? Is there a way to do what we want without
spark-submit? Have others here solved this in some other way?


Re: Spark structured streaming watermarks on nested attributes

2019-05-06 Thread Pat Ferrel
Streams have no end until watermarked or closed. Joins need bounded
datasets, et voila. Something tells me you should consider the streaming
nature of your data and whether your joins need to use increments/snippets
of infinite streams or to re-join the entire contents of the streams
accumulated at checkpoints.


From: Joe Ammann  
Reply: Joe Ammann  
Date: May 6, 2019 at 6:45:13 AM
To: user@spark.apache.org  
Subject:  Spark structured streaming watermarks on nested attributes

Hi all

I'm pretty new to Spark and implementing my first non-trivial structured
streaming job with outer joins. My environment is a Hortonworks HDP 3.1
cluster with Spark 2.3.2, working with Python.

I understood that I need to provide watermarks and join conditions for left
outer joins to work. All my incoming Kafka streams have an attribute
"LAST_MODIFICATION" which is well suited to indicate the event time, so I
chose that for watermarking. Since I'm joining from multiple topics where
the incoming messages have common attributes, I though I'd prefix/nest all
incoming messages. Something like

entity1DF.select(struct("*").alias("entity1")).withWatermark("entity1.LAST_MODIFICATION")

entity2DF.select(struct("*").alias("entity2")).withWatermark("entity2.LAST_MODIFICATION")


Now when I try to join such 2 streams, it would fail and tell me that I
need to use watermarks

When I leave the watermarking attribute "at the top level", everything
works as expected, e.g.

entity1DF.select(struct("*").alias("entity1"),
col("LAST_MODIFICATION").alias("entity1_LAST_MODIFICATION")).withWatermark("entity1_LAST_MODIFICATION")


Before I hunt this down any further, is this kind of a known limitation? Or
am I doing something fundamentally wrong?

-- 
CU, Joe

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Deep Learning with Spark, what is your experience?

2019-05-04 Thread Pat Ferrel
@Riccardo

Spark does not do the DL learning part of the pipeline (afaik) so it is
limited to data ingestion and transforms (ETL). It therefore is optional
and other ETL options might be better for you.

Most of the technologies @Gourav mentions have their own scaling based on
their own compute engines specialized for their DL implementations, so be
aware that Spark scaling has nothing to do with scaling most of the DL
engines, they have their own solutions.

From: Gourav Sengupta 

Reply: Gourav Sengupta 

Date: May 4, 2019 at 10:24:29 AM
To: Riccardo Ferrari  
Cc: User  
Subject:  Re: Deep Learning with Spark, what is your experience?

Try using MxNet and Horovod directly as well (I think that MXNet is worth a
try as well):
1.
https://medium.com/apache-mxnet/distributed-training-using-apache-mxnet-with-horovod-44f98bf0e7b7
2.
https://docs.nvidia.com/deeplearning/dgx/mxnet-release-notes/rel_19-01.html
3. https://aws.amazon.com/mxnet/
4.
https://aws.amazon.com/blogs/machine-learning/aws-deep-learning-amis-now-include-horovod-for-faster-multi-gpu-tensorflow-training-on-amazon-ec2-p3-instances/


Ofcourse Tensorflow is backed by Google's advertisement team as well
https://aws.amazon.com/blogs/machine-learning/scalable-multi-node-training-with-tensorflow/


Regards,




On Sat, May 4, 2019 at 10:59 AM Riccardo Ferrari  wrote:

> Hi list,
>
> I am trying to undestand if ti make sense to leverage on Spark as enabling
> platform for Deep Learning.
>
> My open question to you are:
>
>- Do you use Apache Spark in you DL pipelines?
>- How do you use Spark for DL? Is it just a stand-alone stage in the
>workflow (ie data preparation script) or is it  more integrated
>
> I see a major advantage in leveraging on Spark as a unified entrypoint,
> for example you can easily abstract data sources and leverage on existing
> team skills for data pre-processing and training. On the flip side you may
> hit some limitations including supported versions and so on.
> What is your experience?
>
> Thanks!
>


Livy with Standalone Spark Master

2019-04-20 Thread Pat Ferrel
Does Livy work with a Standalone Spark Master?


Re: new install help

2019-04-15 Thread Pat Ferrel
Most people running on a Windows machine use a VM running Linux. You will
run into constant issues if you go down another road with something like
cygwin, so avoid the headache.


From: Steve Pruitt  
Reply: user@predictionio.apache.org 

Date: April 15, 2019 at 10:59:09 AM
To: user@predictionio.apache.org 

Subject:  new install help

I installed on a Windows 10 box.  A couple of questions and then a problem
I have.



I downloaded the binary distribution.

I already had Spark installed, so I changed pio-env.sh to point to my Spark.

I downloaded and installed Postgres.  I downloaded the jdbc driver and put
it in the PredictionIO-0.14.0\lib folder.



My questions are:

Reading the PIO install directions I cannot tell if ElasticSearch and HBase
are optional.  The pio-env.sh file has references to them commented out and
the PIO install page makes mention of skipping them if not using them.  So,
I didn’t install them.



When I tried executing PredictionIO-0.14.0\bin\pio eventserver & command
from the command line, I got this error

'PredictionIO-0.14.0\bin\pio' is not recognized as an internal or external
command, operable program or batch file.



Oops.  I think my assumption PIO runs on Windows is bad.  I want to confirm
it’s not something I overlooked.



-S


Why not a Top Level Project?

2019-04-08 Thread Pat Ferrel
To slightly over simplify, all it takes to be a TLP for Apache is:
1) clear community support
2) a couple Apache members to sponsor (Incubator members help)
3) demonstrated processes that follow the Apache way
4) the will of committers and PMC to move to TLP

What is missing in Livy?

I am starting to use Livy but like anyone who sees the “incubator” will be
overly cautious. There is a clear need for this project beyond the use
cases mentioned. For instance we have a Machine Learning Server that tries
to be compute engine neutral but practically speaking uses Spark and HDFS
for several algorithms. We would have a hard time scaling a service that
runs the Spark Driver in the server process. The solution may well be Livy.

Here’s hoping Livy becomes a TLP

- Pat


Re: spark.submit.deployMode: cluster

2019-03-28 Thread Pat Ferrel
Thanks, are you referring to
https://github.com/spark-jobserver/spark-jobserver or the undocumented REST
job server included in Spark?


From: Jason Nerothin  
Reply: Jason Nerothin  
Date: March 28, 2019 at 2:53:05 PM
To: Pat Ferrel  
Cc: Felix Cheung 
, Marcelo
Vanzin  , user
 
Subject:  Re: spark.submit.deployMode: cluster

Check out the Spark Jobs API... it sits behind a REST service...


On Thu, Mar 28, 2019 at 12:29 Pat Ferrel  wrote:

> ;-)
>
> Great idea. Can you suggest a project?
>
> Apache PredictionIO uses spark-submit (very ugly) and Apache Mahout only
> launches trivially in test apps since most uses are as a lib.
>
>
> From: Felix Cheung  
> Reply: Felix Cheung 
> 
> Date: March 28, 2019 at 9:42:31 AM
> To: Pat Ferrel  , Marcelo
> Vanzin  
> Cc: user  
> Subject:  Re: spark.submit.deployMode: cluster
>
> If anyone wants to improve docs please create a PR.
>
> lol
>
>
> But seriously you might want to explore other projects that manage job
> submission on top of spark instead of rolling your own with spark-submit.
>
>
> --
> *From:* Pat Ferrel 
> *Sent:* Tuesday, March 26, 2019 2:38 PM
> *To:* Marcelo Vanzin
> *Cc:* user
> *Subject:* Re: spark.submit.deployMode: cluster
>
> Ahh, thank you indeed!
>
> It would have saved us a lot of time if this had been documented. I know,
> OSS so contributions are welcome… I can also imagine your next comment; “If
> anyone wants to improve docs see the Apache contribution rules and create a
> PR.” or something like that.
>
> BTW the code where the context is known and can be used is what I’d call a
> Driver and since all code is copied to nodes and is know in jars, it was
> not obvious to us that this rule existed but it does make sense.
>
> We will need to refactor our code to use spark-submit it appears.
>
> Thanks again.
>
>
> From: Marcelo Vanzin  
> Reply: Marcelo Vanzin  
> Date: March 26, 2019 at 1:59:36 PM
> To: Pat Ferrel  
> Cc: user  
> Subject:  Re: spark.submit.deployMode: cluster
>
> If you're not using spark-submit, then that option does nothing.
>
> If by "context creation API" you mean "new SparkContext()" or an
> equivalent, then you're explicitly creating the driver inside your
> application.
>
> On Tue, Mar 26, 2019 at 1:56 PM Pat Ferrel  wrote:
> >
> > I have a server that starts a Spark job using the context creation API.
> It DOES NOY use spark-submit.
> >
> > I set spark.submit.deployMode = “cluster”
> >
> > In the GUI I see 2 workers with 2 executors. The link for running
> application “name” goes back to my server, the machine that launched the
> job.
> >
> > This is spark.submit.deployMode = “client” according to the docs. I set
> the Driver to run on the cluster but it runs on the client, ignoring the
> spark.submit.deployMode.
> >
> > Is this as expected? It is documented nowhere I can find.
> >
>
>
> --
> Marcelo
>
> --
Thanks,
Jason


Re: spark.submit.deployMode: cluster

2019-03-28 Thread Pat Ferrel
;-)

Great idea. Can you suggest a project?

Apache PredictionIO uses spark-submit (very ugly) and Apache Mahout only
launches trivially in test apps since most uses are as a lib.


From: Felix Cheung  
Reply: Felix Cheung  
Date: March 28, 2019 at 9:42:31 AM
To: Pat Ferrel  , Marcelo
Vanzin  
Cc: user  
Subject:  Re: spark.submit.deployMode: cluster

If anyone wants to improve docs please create a PR.

lol


But seriously you might want to explore other projects that manage job
submission on top of spark instead of rolling your own with spark-submit.


--
*From:* Pat Ferrel 
*Sent:* Tuesday, March 26, 2019 2:38 PM
*To:* Marcelo Vanzin
*Cc:* user
*Subject:* Re: spark.submit.deployMode: cluster

Ahh, thank you indeed!

It would have saved us a lot of time if this had been documented. I know,
OSS so contributions are welcome… I can also imagine your next comment; “If
anyone wants to improve docs see the Apache contribution rules and create a
PR.” or something like that.

BTW the code where the context is known and can be used is what I’d call a
Driver and since all code is copied to nodes and is know in jars, it was
not obvious to us that this rule existed but it does make sense.

We will need to refactor our code to use spark-submit it appears.

Thanks again.


From: Marcelo Vanzin  
Reply: Marcelo Vanzin  
Date: March 26, 2019 at 1:59:36 PM
To: Pat Ferrel  
Cc: user  
Subject:  Re: spark.submit.deployMode: cluster

If you're not using spark-submit, then that option does nothing.

If by "context creation API" you mean "new SparkContext()" or an
equivalent, then you're explicitly creating the driver inside your
application.

On Tue, Mar 26, 2019 at 1:56 PM Pat Ferrel  wrote:
>
> I have a server that starts a Spark job using the context creation API.
It DOES NOY use spark-submit.
>
> I set spark.submit.deployMode = “cluster”
>
> In the GUI I see 2 workers with 2 executors. The link for running
application “name” goes back to my server, the machine that launched the
job.
>
> This is spark.submit.deployMode = “client” according to the docs. I set
the Driver to run on the cluster but it runs on the client, ignoring the
spark.submit.deployMode.
>
> Is this as expected? It is documented nowhere I can find.
>


--
Marcelo


Re: Where does the Driver run?

2019-03-28 Thread Pat Ferrel
Thanks for the pointers. We’ll investigate.

We have been told that the “Driver” is run in the launching JVM because
deployMode = cluster is ignored if spark-submit is not used to launch.

You are saying that there is a loophole and if you use one of these client
classes there is a way to run part of the app on the cluster, and you have
seen this for Yarn?

To explain more, we create a SparkConf, and then a SparkContext, which we
pass around implicitly to functions that I would define as the Spark
Driver. It seems that if you do not use spark-submit, the entire launching
app/JVM process is considered the Driver AND is always run in client mode.

I hope your loophole pays off or we will have to do a major refactoring.


From: Jianneng Li  
Reply: Jianneng Li  
Date: March 28, 2019 at 2:03:47 AM
To: p...@occamsmachete.com  
Cc: andrew.m...@gmail.com  ,
user@spark.apache.org  ,
ak...@hacked.work  
Subject:  Re: Where does the Driver run?

Hi Pat,

The driver runs in the same JVM as SparkContext. You didn't go into detail
about how you "launch" the job (i.e. how the SparkContext is created), so
it's hard for me to guess where the driver is.

For reference, we've had success launching Spark programmatically to YARN
in cluster mode by creating a SparkConf like you did and using it to call
this class:
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

I haven't tried this myself, but for standalone mode you might be able to
use this:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/Client.scala

Lastly, you can always check where Spark processes run by executing ps on
the machine, i.e. `ps aux | grep java`.

Best,

Jianneng



*From:* Pat Ferrel 
*Date:* Monday, March 25, 2019 at 12:58 PM
*To:* Andrew Melo 
*Cc:* user , Akhil Das 
*Subject:* Re: Where does the Driver run?



I’m beginning to agree with you and find it rather surprising that this is
mentioned nowhere explicitly (maybe I missed?). It is possible to serialize
code to be executed in executors to various nodes. It also seems possible
to serialize the “driver” bits of code although I’m not sure how the
boundary would be defined. All code is in the jars we pass to Spark so
until now I did not question the docs.



I see no mention of a distinction between running a driver in spark-submit
vs being programmatically launched for any of the Spark Master types:
Standalone, Yarn, Mesos, k8s.



We are building a Machine Learning Server in OSS. It has pluggable Engines
for different algorithms. Some of these use Spark so it is highly desirable
to offload driver code to the cluster since we don’t want the diver
embedded in the Server process. The Driver portion of our training workflow
could be very large indeed and so could force the scaling of the server to
worst case.



I hope someone knows how to run “Driver” code on the cluster when our
server is launching the code. So deployMode = cluster, deploy method =
programatic launch.




From: Andrew Melo  
Reply: Andrew Melo  
Date: March 25, 2019 at 11:40:07 AM
To: Pat Ferrel  
Cc: Akhil Das  , user
 
Subject:  Re: Where does the Driver run?



Hi Pat,



Indeed, I don't think that it's possible to use cluster mode w/o
spark-submit. All the docs I see appear to always describe needing to use
spark-submit for cluster mode -- it's not even compatible with spark-shell.
But it makes sense to me -- if you want Spark to run your application's
driver, you need to package it up and send it to the cluster manager. You
can't start spark one place and then later migrate it to the cluster. It's
also why you can't use spark-shell in cluster mode either, I think.



Cheers

Andrew



On Mon, Mar 25, 2019 at 11:22 AM Pat Ferrel  wrote:

In the GUI while the job is running the app-id link brings up logs to both
executors, The “name” link goes to 4040 of the machine that launched the
job but is not resolvable right now so the page is not shown. I’ll try the
netstat but the use of port 4040 was a good clue.



By what you say below this indicates the Driver is running on the launching
machine, the client to the Spark Cluster. This should be the case in
deployMode = client.



Can someone explain what us going on? The Evidence seems to say that
deployMode = cluster *does not work* as described unless you use
spark-submit (and I’m only guessing at that).



Further; if we don’t use spark-submit we can’t use deployMode = cluster ???




From: Akhil Das  
Reply: Akhil Das  
Date: March 24, 2019 at 7:45:07 PM
To: Pat Ferrel  
Cc: user  
Subject:  Re: Where does the Driver run?



There's also a driver ui (usually available on port 4040), after running
your code, I assume you are running it on your machine, visit
localhost:4040 and you will get the driver UI.



If you think the driver is running on your master/executor nodes, login to
those machines and do a



   netstat -napt | grep -I listen



You will see 

Re: spark.submit.deployMode: cluster

2019-03-26 Thread Pat Ferrel
Ahh, thank you indeed!

It would have saved us a lot of time if this had been documented. I know,
OSS so contributions are welcome… I can also imagine your next comment; “If
anyone wants to improve docs see the Apache contribution rules and create a
PR.” or something like that.

BTW the code where the context is known and can be used is what I’d call a
Driver and since all code is copied to nodes and is know in jars, it was
not obvious to us that this rule existed but it does make sense.

We will need to refactor our code to use spark-submit it appears.

Thanks again.


From: Marcelo Vanzin  
Reply: Marcelo Vanzin  
Date: March 26, 2019 at 1:59:36 PM
To: Pat Ferrel  
Cc: user  
Subject:  Re: spark.submit.deployMode: cluster

If you're not using spark-submit, then that option does nothing.

If by "context creation API" you mean "new SparkContext()" or an
equivalent, then you're explicitly creating the driver inside your
application.

On Tue, Mar 26, 2019 at 1:56 PM Pat Ferrel  wrote:
>
> I have a server that starts a Spark job using the context creation API.
It DOES NOY use spark-submit.
>
> I set spark.submit.deployMode = “cluster”
>
> In the GUI I see 2 workers with 2 executors. The link for running
application “name” goes back to my server, the machine that launched the
job.
>
> This is spark.submit.deployMode = “client” according to the docs. I set
the Driver to run on the cluster but it runs on the client, ignoring the
spark.submit.deployMode.
>
> Is this as expected? It is documented nowhere I can find.
>


-- 
Marcelo


spark.submit.deployMode: cluster

2019-03-26 Thread Pat Ferrel
I have a server that starts a Spark job using the context creation API. It
DOES NOY use spark-submit.

I set spark.submit.deployMode = “cluster”

In the GUI I see 2 workers with 2 executors. The link for running
application “name” goes back to my server, the machine that launched the
job.

This is spark.submit.deployMode = “client” according to the docs. I set the
Driver to run on the cluster but it runs on the client, *ignoring
the spark.submit.deployMode*.

Is this as expected? It is documented nowhere I can find.


Re: Where does the Driver run?

2019-03-25 Thread Pat Ferrel
I’m beginning to agree with you and find it rather surprising that this is
mentioned nowhere explicitly (maybe I missed?). It is possible to serialize
code to be executed in executors to various nodes. It also seems possible
to serialize the “driver” bits of code although I’m not sure how the
boundary would be defined. All code is in the jars we pass to Spark so
until now I did not question the docs.

I see no mention of a distinction between running a driver in spark-submit
vs being programmatically launched for any of the Spark Master types:
Standalone, Yarn, Mesos, k8s.

We are building a Machine Learning Server in OSS. It has pluggable Engines
for different algorithms. Some of these use Spark so it is highly desirable
to offload driver code to the cluster since we don’t want the diver
embedded in the Server process. The Driver portion of our training workflow
could be very large indeed and so could force the scaling of the server to
worst case.

I hope someone knows how to run “Driver” code on the cluster when our
server is launching the code. So deployMode = cluster, deploy method =
programatic launch.


From: Andrew Melo  
Reply: Andrew Melo  
Date: March 25, 2019 at 11:40:07 AM
To: Pat Ferrel  
Cc: Akhil Das  , user
 
Subject:  Re: Where does the Driver run?

Hi Pat,

Indeed, I don't think that it's possible to use cluster mode w/o
spark-submit. All the docs I see appear to always describe needing to use
spark-submit for cluster mode -- it's not even compatible with spark-shell.
But it makes sense to me -- if you want Spark to run your application's
driver, you need to package it up and send it to the cluster manager. You
can't start spark one place and then later migrate it to the cluster. It's
also why you can't use spark-shell in cluster mode either, I think.

Cheers
Andrew

On Mon, Mar 25, 2019 at 11:22 AM Pat Ferrel  wrote:

> In the GUI while the job is running the app-id link brings up logs to both
> executors, The “name” link goes to 4040 of the machine that launched the
> job but is not resolvable right now so the page is not shown. I’ll try the
> netstat but the use of port 4040 was a good clue.
>
> By what you say below this indicates the Driver is running on the
> launching machine, the client to the Spark Cluster. This should be the case
> in deployMode = client.
>
> Can someone explain what us going on? The Evidence seems to say that
> deployMode = cluster *does not work* as described unless you use
> spark-submit (and I’m only guessing at that).
>
> Further; if we don’t use spark-submit we can’t use deployMode = cluster ???
>
>
> From: Akhil Das  
> Reply: Akhil Das  
> Date: March 24, 2019 at 7:45:07 PM
> To: Pat Ferrel  
> Cc: user  
> Subject:  Re: Where does the Driver run?
>
> There's also a driver ui (usually available on port 4040), after running
> your code, I assume you are running it on your machine, visit
> localhost:4040 and you will get the driver UI.
>
> If you think the driver is running on your master/executor nodes, login to
> those machines and do a
>
>netstat -napt | grep -I listen
>
> You will see the driver listening on 404x there, this won't be the case
> mostly as you are not doing Spark-submit or using the deployMode=cluster.
>
> On Mon, 25 Mar 2019, 01:03 Pat Ferrel,  wrote:
>
>> Thanks, I have seen this many times in my research. Paraphrasing docs:
>> “in deployMode ‘cluster' the Driver runs on a Worker in the cluster”
>>
>> When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1
>> with addresses that match slaves). When I look at memory usage while the
>> job runs I see virtually identical usage on the 2 Workers. This would
>> support your claim and contradict Spark docs for deployMode = cluster.
>>
>> The evidence seems to contradict the docs. I am now beginning to wonder
>> if the Driver only runs in the cluster if we use spark-submit
>>
>>
>>
>> From: Akhil Das  
>> Reply: Akhil Das  
>> Date: March 23, 2019 at 9:26:50 PM
>> To: Pat Ferrel  
>> Cc: user  
>> Subject:  Re: Where does the Driver run?
>>
>> If you are starting your "my-app" on your local machine, that's where the
>> driver is running.
>>
>> [image: image.png]
>>
>> Hope this helps.
>> <https://spark.apache.org/docs/latest/cluster-overview.html>
>>
>> On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:
>>
>>> I have researched this for a significant amount of time and find answers
>>> that seem to be for a slightly different question than mine.
>>>
>>> The Spark 2.3.3 cluster is running fine. I see the GUI on “
>>> http://master-address:8080;, there are 2 idle workers, as configured.
>>>
>>> I have a S

Re: Where does the Driver run?

2019-03-25 Thread Pat Ferrel
In the GUI while the job is running the app-id link brings up logs to both
executors, The “name” link goes to 4040 of the machine that launched the
job but is not resolvable right now so the page is not shown. I’ll try the
netstat but the use of port 4040 was a good clue.

By what you say below this indicates the Driver is running on the launching
machine, the client to the Spark Cluster. This should be the case in
deployMode = client.

Can someone explain what us going on? The Evidence seems to say that
deployMode = cluster *does not work *as described unless you use
spark-submit (and I’m only guessing at that).

Further; if we don’t use spark-submit we can’t use deployMode = cluster ???


From: Akhil Das  
Reply: Akhil Das  
Date: March 24, 2019 at 7:45:07 PM
To: Pat Ferrel  
Cc: user  
Subject:  Re: Where does the Driver run?

There's also a driver ui (usually available on port 4040), after running
your code, I assume you are running it on your machine, visit
localhost:4040 and you will get the driver UI.

If you think the driver is running on your master/executor nodes, login to
those machines and do a

   netstat -napt | grep -I listen

You will see the driver listening on 404x there, this won't be the case
mostly as you are not doing Spark-submit or using the deployMode=cluster.

On Mon, 25 Mar 2019, 01:03 Pat Ferrel,  wrote:

> Thanks, I have seen this many times in my research. Paraphrasing docs: “in
> deployMode ‘cluster' the Driver runs on a Worker in the cluster”
>
> When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1
> with addresses that match slaves). When I look at memory usage while the
> job runs I see virtually identical usage on the 2 Workers. This would
> support your claim and contradict Spark docs for deployMode = cluster.
>
> The evidence seems to contradict the docs. I am now beginning to wonder if
> the Driver only runs in the cluster if we use spark-submit
>
>
>
> From: Akhil Das  
> Reply: Akhil Das  
> Date: March 23, 2019 at 9:26:50 PM
> To: Pat Ferrel  
> Cc: user  
> Subject:  Re: Where does the Driver run?
>
> If you are starting your "my-app" on your local machine, that's where the
> driver is running.
>
> [image: image.png]
>
> Hope this helps.
> <https://spark.apache.org/docs/latest/cluster-overview.html>
>
> On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:
>
>> I have researched this for a significant amount of time and find answers
>> that seem to be for a slightly different question than mine.
>>
>> The Spark 2.3.3 cluster is running fine. I see the GUI on “
>> http://master-address:8080;, there are 2 idle workers, as configured.
>>
>> I have a Scala application that creates a context and starts execution of
>> a Job. I *do not use spark-submit*, I start the Job programmatically and
>> this is where many explanations forks from my question.
>>
>> In "my-app" I create a new SparkConf, with the following code (slightly
>> abbreviated):
>>
>>   conf.setAppName(“my-job")
>>   conf.setMaster(“spark://master-address:7077”)
>>   conf.set(“deployMode”, “cluster”)
>>   // other settings like driver and executor memory requests
>>   // the driver and executor memory requests are for all mem on the
>> slaves, more than
>>   // mem available on the launching machine with “my-app"
>>   val jars = listJars(“/path/to/lib")
>>   conf.setJars(jars)
>>   …
>>
>> When I launch the job I see 2 executors running on the 2 workers/slaves.
>> Everything seems to run fine and sometimes completes successfully. Frequent
>> failures are the reason for this question.
>>
>> Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
>> taking all cluster resources. With a Yarn cluster I would expect the
>> “Driver" to run on/in the Yarn Master but I am using the Spark Standalone
>> Master, where is the Drive part of the Job running?
>>
>> If is is running in the Master, we are in trouble because I start the
>> Master on one of my 2 Workers sharing resources with one of the Executors.
>> Executor mem + driver mem is > available mem on a Worker. I can change this
>> but need so understand where the Driver part of the Spark Job runs. Is it
>> in the Spark Master, or inside and Executor, or ???
>>
>> The “Driver” creates and broadcasts some large data structures so the
>> need for an answer is more critical than with more typical tiny Drivers.
>>
>> Thanks for you help!
>>
>
>
> --
> Cheers!
>
>


Re: Where does the Driver run?

2019-03-24 Thread Pat Ferrel
2 Slaves, one of which is also Master.

Node 1 & 2 are slaves. Node 1 is where I run start-all.sh.

The machines both have 60g of free memory (leaving about 4g for the master
process on Node 1). The only constraint to the Driver and Executors is
spark.driver.memory = spark.executor.memory = 60g

BTW I would expect this to create one Executor, one Driver, and the Master
on 2 Workers.




From: Andrew Melo  
Reply: Andrew Melo  
Date: March 24, 2019 at 12:46:35 PM
To: Pat Ferrel  
Cc: Akhil Das  , user
 
Subject:  Re: Where does the Driver run?

Hi Pat,

On Sun, Mar 24, 2019 at 1:03 PM Pat Ferrel  wrote:

> Thanks, I have seen this many times in my research. Paraphrasing docs: “in
> deployMode ‘cluster' the Driver runs on a Worker in the cluster”
>
> When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1
> with addresses that match slaves). When I look at memory usage while the
> job runs I see virtually identical usage on the 2 Workers. This would
> support your claim and contradict Spark docs for deployMode = cluster.
>
> The evidence seems to contradict the docs. I am now beginning to wonder if
> the Driver only runs in the cluster if we use spark-submit
>

Where/how are you starting "./sbin/start-master.sh"?

Cheers
Andrew


>
>
>
> From: Akhil Das  
> Reply: Akhil Das  
> Date: March 23, 2019 at 9:26:50 PM
> To: Pat Ferrel  
> Cc: user  
> Subject:  Re: Where does the Driver run?
>
> If you are starting your "my-app" on your local machine, that's where the
> driver is running.
>
> [image: image.png]
>
> Hope this helps.
> <https://spark.apache.org/docs/latest/cluster-overview.html>
>
> On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:
>
>> I have researched this for a significant amount of time and find answers
>> that seem to be for a slightly different question than mine.
>>
>> The Spark 2.3.3 cluster is running fine. I see the GUI on “
>> http://master-address:8080;, there are 2 idle workers, as configured.
>>
>> I have a Scala application that creates a context and starts execution of
>> a Job. I *do not use spark-submit*, I start the Job programmatically and
>> this is where many explanations forks from my question.
>>
>> In "my-app" I create a new SparkConf, with the following code (slightly
>> abbreviated):
>>
>>   conf.setAppName(“my-job")
>>   conf.setMaster(“spark://master-address:7077”)
>>   conf.set(“deployMode”, “cluster”)
>>   // other settings like driver and executor memory requests
>>   // the driver and executor memory requests are for all mem on the
>> slaves, more than
>>   // mem available on the launching machine with “my-app"
>>   val jars = listJars(“/path/to/lib")
>>   conf.setJars(jars)
>>   …
>>
>> When I launch the job I see 2 executors running on the 2 workers/slaves.
>> Everything seems to run fine and sometimes completes successfully. Frequent
>> failures are the reason for this question.
>>
>> Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
>> taking all cluster resources. With a Yarn cluster I would expect the
>> “Driver" to run on/in the Yarn Master but I am using the Spark Standalone
>> Master, where is the Drive part of the Job running?
>>
>> If is is running in the Master, we are in trouble because I start the
>> Master on one of my 2 Workers sharing resources with one of the Executors.
>> Executor mem + driver mem is > available mem on a Worker. I can change this
>> but need so understand where the Driver part of the Spark Job runs. Is it
>> in the Spark Master, or inside and Executor, or ???
>>
>> The “Driver” creates and broadcasts some large data structures so the
>> need for an answer is more critical than with more typical tiny Drivers.
>>
>> Thanks for you help!
>>
>
>
> --
> Cheers!
>
>


CCEACC67-4431-4246-AEB8-60CEC0940BA9
Description: Binary data


Re: Where does the Driver run?

2019-03-24 Thread Pat Ferrel
2 Slaves, one of which is also Master.

Node 1 & 2 are slaves. Node 1 is where I run start-all.sh.

The machines both have 60g of free memory (leaving about 4g for the master
process on Node 1). The only constraint to the Driver and Executors is
spark.driver.memory = spark.executor.memory = 60g


From: Andrew Melo  
Reply: Andrew Melo  
Date: March 24, 2019 at 12:46:35 PM
To: Pat Ferrel  
Cc: Akhil Das  , user
 
Subject:  Re: Where does the Driver run?

Hi Pat,

On Sun, Mar 24, 2019 at 1:03 PM Pat Ferrel  wrote:

> Thanks, I have seen this many times in my research. Paraphrasing docs: “in
> deployMode ‘cluster' the Driver runs on a Worker in the cluster”
>
> When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1
> with addresses that match slaves). When I look at memory usage while the
> job runs I see virtually identical usage on the 2 Workers. This would
> support your claim and contradict Spark docs for deployMode = cluster.
>
> The evidence seems to contradict the docs. I am now beginning to wonder if
> the Driver only runs in the cluster if we use spark-submit
>

Where/how are you starting "./sbin/start-master.sh"?

Cheers
Andrew


>
>
>
> From: Akhil Das  
> Reply: Akhil Das  
> Date: March 23, 2019 at 9:26:50 PM
> To: Pat Ferrel  
> Cc: user  
> Subject:  Re: Where does the Driver run?
>
> If you are starting your "my-app" on your local machine, that's where the
> driver is running.
>
> [image: image.png]
>
> Hope this helps.
> <https://spark.apache.org/docs/latest/cluster-overview.html>
>
> On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:
>
>> I have researched this for a significant amount of time and find answers
>> that seem to be for a slightly different question than mine.
>>
>> The Spark 2.3.3 cluster is running fine. I see the GUI on “
>> http://master-address:8080;, there are 2 idle workers, as configured.
>>
>> I have a Scala application that creates a context and starts execution of
>> a Job. I *do not use spark-submit*, I start the Job programmatically and
>> this is where many explanations forks from my question.
>>
>> In "my-app" I create a new SparkConf, with the following code (slightly
>> abbreviated):
>>
>>   conf.setAppName(“my-job")
>>   conf.setMaster(“spark://master-address:7077”)
>>   conf.set(“deployMode”, “cluster”)
>>   // other settings like driver and executor memory requests
>>   // the driver and executor memory requests are for all mem on the
>> slaves, more than
>>   // mem available on the launching machine with “my-app"
>>   val jars = listJars(“/path/to/lib")
>>   conf.setJars(jars)
>>   …
>>
>> When I launch the job I see 2 executors running on the 2 workers/slaves.
>> Everything seems to run fine and sometimes completes successfully. Frequent
>> failures are the reason for this question.
>>
>> Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
>> taking all cluster resources. With a Yarn cluster I would expect the
>> “Driver" to run on/in the Yarn Master but I am using the Spark Standalone
>> Master, where is the Drive part of the Job running?
>>
>> If is is running in the Master, we are in trouble because I start the
>> Master on one of my 2 Workers sharing resources with one of the Executors.
>> Executor mem + driver mem is > available mem on a Worker. I can change this
>> but need so understand where the Driver part of the Spark Job runs. Is it
>> in the Spark Master, or inside and Executor, or ???
>>
>> The “Driver” creates and broadcasts some large data structures so the
>> need for an answer is more critical than with more typical tiny Drivers.
>>
>> Thanks for you help!
>>
>
>
> --
> Cheers!
>
>


3847fb65eedb5792_0.1.1
Description: Binary data


Re: Where does the Driver run?

2019-03-24 Thread Pat Ferrel
Thanks, I have seen this many times in my research. Paraphrasing docs: “in
deployMode ‘cluster' the Driver runs on a Worker in the cluster”

When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1
with addresses that match slaves). When I look at memory usage while the
job runs I see virtually identical usage on the 2 Workers. This would
support your claim and contradict Spark docs for deployMode = cluster.

The evidence seems to contradict the docs. I am now beginning to wonder if
the Driver only runs in the cluster if we use spark-submit



From: Akhil Das  
Reply: Akhil Das  
Date: March 23, 2019 at 9:26:50 PM
To: Pat Ferrel  
Cc: user  
Subject:  Re: Where does the Driver run?

If you are starting your "my-app" on your local machine, that's where the
driver is running.

[image: image.png]

Hope this helps.
<https://spark.apache.org/docs/latest/cluster-overview.html>

On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:

> I have researched this for a significant amount of time and find answers
> that seem to be for a slightly different question than mine.
>
> The Spark 2.3.3 cluster is running fine. I see the GUI on “
> http://master-address:8080;, there are 2 idle workers, as configured.
>
> I have a Scala application that creates a context and starts execution of
> a Job. I *do not use spark-submit*, I start the Job programmatically and
> this is where many explanations forks from my question.
>
> In "my-app" I create a new SparkConf, with the following code (slightly
> abbreviated):
>
>   conf.setAppName(“my-job")
>   conf.setMaster(“spark://master-address:7077”)
>   conf.set(“deployMode”, “cluster”)
>   // other settings like driver and executor memory requests
>   // the driver and executor memory requests are for all mem on the
> slaves, more than
>   // mem available on the launching machine with “my-app"
>   val jars = listJars(“/path/to/lib")
>   conf.setJars(jars)
>   …
>
> When I launch the job I see 2 executors running on the 2 workers/slaves.
> Everything seems to run fine and sometimes completes successfully. Frequent
> failures are the reason for this question.
>
> Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
> taking all cluster resources. With a Yarn cluster I would expect the
> “Driver" to run on/in the Yarn Master but I am using the Spark Standalone
> Master, where is the Drive part of the Job running?
>
> If is is running in the Master, we are in trouble because I start the
> Master on one of my 2 Workers sharing resources with one of the Executors.
> Executor mem + driver mem is > available mem on a Worker. I can change this
> but need so understand where the Driver part of the Spark Job runs. Is it
> in the Spark Master, or inside and Executor, or ???
>
> The “Driver” creates and broadcasts some large data structures so the need
> for an answer is more critical than with more typical tiny Drivers.
>
> Thanks for you help!
>


--
Cheers!


ii_jtmf6k1q0.png
Description: Binary data


Where does the Driver run?

2019-03-23 Thread Pat Ferrel
I have researched this for a significant amount of time and find answers
that seem to be for a slightly different question than mine.

The Spark 2.3.3 cluster is running fine. I see the GUI on “
http://master-address:8080;, there are 2 idle workers, as configured.

I have a Scala application that creates a context and starts execution of a
Job. I *do not use spark-submit*, I start the Job programmatically and this
is where many explanations forks from my question.

In "my-app" I create a new SparkConf, with the following code (slightly
abbreviated):

  conf.setAppName(“my-job")
  conf.setMaster(“spark://master-address:7077”)
  conf.set(“deployMode”, “cluster”)
  // other settings like driver and executor memory requests
  // the driver and executor memory requests are for all mem on the
slaves, more than
  // mem available on the launching machine with “my-app"
  val jars = listJars(“/path/to/lib")
  conf.setJars(jars)
  …

When I launch the job I see 2 executors running on the 2 workers/slaves.
Everything seems to run fine and sometimes completes successfully. Frequent
failures are the reason for this question.

Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
taking all cluster resources. With a Yarn cluster I would expect the
“Driver" to run on/in the Yarn Master but I am using the Spark Standalone
Master, where is the Drive part of the Job running?

If is is running in the Master, we are in trouble because I start the
Master on one of my 2 Workers sharing resources with one of the Executors.
Executor mem + driver mem is > available mem on a Worker. I can change this
but need so understand where the Driver part of the Spark Job runs. Is it
in the Spark Master, or inside and Executor, or ???

The “Driver” creates and broadcasts some large data structures so the need
for an answer is more critical than with more typical tiny Drivers.

Thanks for you help!


Re: Spark with Kubernetes connecting to pod ID, not address

2019-02-13 Thread Pat Ferrel
solve(SimpleNameResolver.java:55)
at 
io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
at 
io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
at 
io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208)
at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174)
at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at 
io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
at 
io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:978)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:512)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:423)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:482)
at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more



From: Erik Erlandson 
Date: February 13, 2019 at 4:57:30 AM
To: Pat Ferrel 
Subject:  Re: Spark with Kubernetes connecting to pod id, not address  

Hi Pat,

I'd suggest visiting the big data slack channel, it's a more spark oriented 
forum than kube-dev:
https://kubernetes.slack.com/messages/C0ELB338T/

Tentatively, I think you may want to submit in client mode (unless you are 
initiating your application from outside the kube cluster). When in client 
mode, you need to set up a headless service for the application driver pod that 
the executors can use to talk back to the driver.
https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode

Cheers,
Erik


On Wed, Feb 13, 2019 at 1:55 AM Pat Ferrel  wrote:
We have a k8s deployment of several services including Apache Spark. All 
services seem to be operational. Our application connects to the Spark master 
to submit a job using the k8s DNS service for the cluster where the master is 
called spark-api so we use master=spark://spark-api:7077 and we use 
spark.submit.deployMode=cluster. We submit the job through the API not by the 
spark-submit script. 

This will run the "driver" and all "executors" on the cluster and this part 
seems to work but there is a callback to the launching code in our app from 
some Spark process. For some reason it is trying to connect to 
harness-64d97d6d6-4r4d8, which is the pod ID, not the k8s cluster IP or DNS.

How could this pod ID be getting into the system? Spark somehow seems to think 
it is the address of the service that called it. Needless to say any connection 
to the k8s pod ID fails and so does the job.

Any idea how Spark could think the pod ID is an IP address or DNS name? 

BTW if we run a small sample job with `master=local` all is well, but the same 
job executed with the above config tries to connect to the spurious pod ID.
--
You received this message because you are subscribed to the Google Groups 
"Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to kubernetes-dev+unsubscr...@googlegroups.com.
To post to this group, send email to kubernetes-...@googlegroups.com.
Visit this group at https://groups.google.com/group/kubernetes-dev.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/kubernetes-dev/36bb6bf8-1cac-428e-8ad7-3d639c90a86b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Spark with Kubernetes connecting to pod id, not address

2019-02-12 Thread Pat Ferrel


From: Pat Ferrel 
Reply: Pat Ferrel 
Date: February 12, 2019 at 5:40:41 PM
To: user@spark.apache.org 
Subject:  Spark with Kubernetes connecting to pod id, not address  

We have a k8s deployment of several services including Apache Spark. All 
services seem to be operational. Our application connects to the Spark master 
to submit a job using the k8s DNS service for the cluster where the master is 
called `spark-api` so we use `master=spark://spark-api:7077` and we use 
`spark.submit.deployMode=cluster`. We submit the job through the API not by the 
spark-submit script. 

This will run the "driver" and all "executors" on the cluster and this part 
seems to work but there is a callback to the launching code in our app from 
some Spark process. For some reason it is trying to connect to 
`harness-64d97d6d6-4r4d8`, which is the **pod ID**, not the k8s cluster IP or 
DNS.

How could this **pod ID** be getting into the system? Spark somehow seems to 
think it is the address of the service that called it. Needless to say any 
connection to the k8s pod ID fails and so does the job.

Any idea how Spark could think the **pod ID** is an IP address or DNS name? 

BTW if we run a small sample job with `master=local` all is well, but the same 
job executed with the above config tries to connect to the spurious pod ID.

BTW2 the pod launching the Spark job has the k8s DNS name "harness-api” not 
sure if this matters

Thanks in advance


Re: [NOTICE] Mandatory migration of git repositories to gitbox.apache.org

2019-01-03 Thread Pat Ferrel
+1


From: Apache Mahout 
Reply: dev@mahout.apache.org 
Date: January 3, 2019 at 11:53:02 AM
To: dev 
Subject:  Re: [NOTICE] Mandatory migration of git repositories to 
gitbox.apache.org  

  

On Thu, 3 Jan 2019 13:51:40 -0600, dev wrote:  

Cool, just making sure we needed it.  

On Thu, Jan 3, 2019 at 1:48 PM Apache Mahout mahout.sh...@gmail.com wrote:  

Trevor, yes form the Notice, a consensus is necessary: • Ensure consensus  
on the move (a link to a lists.apache.org thread will suffice for us as  
evidence).  

On Thu, 3 Jan 2019 19:39:25 +, dev wrote:  

+1  

On 1/3/19, 2:31 PM, "Andrew Palumbo" ap@outlook.com wrote:  

I'd like to call a vote on moving to gitbox. Here's my +1  


Re: universal recommender version

2018-11-27 Thread Pat Ferrel
There is a tag v0.7.3 and yes it is in master:

https://github.com/actionml/universal-recommender/tree/v0.7.3


From: Marco Goldin 
Reply: user@predictionio.apache.org 
Date: November 20, 2018 at 6:56:39 AM
To: user@predictionio.apache.org , 
gyar...@griddynamics.com 
Subject:  Re: universal recommender version  

Hi George, most recent current stable release is 0.7.3, which is simply in the 
branch master, that's why you don't see a 0.7.3 tag.
Git download the master and you'll be fine.
If you check the build.sbt in master you'll see specs as:

version := "0.7.3"
scalaVersion := "2.11.11"

that's the one you're looking for. 

Il giorno mar 20 nov 2018 alle ore 15:47 George Yarish 
 ha scritto:
Hi,

Can please some one advise what is the most recent current release version of 
universal recommender and where it is source code located?

According to GitHub project https://github.com/actionml/universal-recommender 
branches it is v0.8.0 (this branch looks bit outdated)
but according to documentation https://actionml.com/docs/ur_version_log
it is 0.7.3 which can't be found in GitHub repo. 

Thanks,
George

[jira] [Commented] (PIO-31) Move from spray to akka-http in servers

2018-09-19 Thread Pat Ferrel (JIRA)


[ 
https://issues.apache.org/jira/browse/PIO-31?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621051#comment-16621051
 ] 

Pat Ferrel commented on PIO-31:
---

I assume we are talking about the Event Server and the query server both, and 
dropping Spray completely. +1 to that.

> Move from spray to akka-http in servers
> ---
>
> Key: PIO-31
> URL: https://issues.apache.org/jira/browse/PIO-31
> Project: PredictionIO
>  Issue Type: Improvement
>  Components: Core
>Reporter: Marcin Ziemiński
>Priority: Major
>  Labels: gsoc2017, newbie
>
> On account of the death of spray for http and it being reborn as akka-http we 
> should update EventServer and Dashbord. It should be fairly simple, as 
> described in the following guide: 
> http://doc.akka.io/docs/akka/2.4/scala/http/migration-from-spray.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: PIO train issue

2018-08-29 Thread Pat Ferrel
Assuming your are using the UR…

I don’t know how many times this has been caused by a misspelling of
eventNames in engine.json but assume you have checked that.

The fail-safe way to check is to `pio export` your data and check it
against your engine.json.

BTW `pio status` does not even try to check all services. Run `pio app
list` to see if the right appnames (dataset names) are in the EventServer,
which will check hbase, hdfs, and elasticsearch. Then check to see you have
Spark. Elasticsearch and HDFS running—if you have set them to run in remote
standalone mode.


From: bala vivek  
Date: August 29, 2018 at 8:43:05 AM
To: actionml-user 
, user@predictionio.apache.org
 
Subject:  PIO train issue

Hi PIO users,

I'm using the PIO 0.10 version for a long time. I recently moved the
working setup of PIO to CentOS from Ubuntu and it seems to work fine when I
checked the PIO status, It shows all the services are up and working.
But while doing a PIO train I could see "Data set is empty" error, I have
cross checked and saw the hbase table manually by scanning the tables and
the records are present inside the event table. To cross verify I tried to
do a Curl with the help of access key for a particular app and the response
to it is "http200.ok"  so it's confirmed the app id or a particular app has
the data.
But if I run the command pio train manually it's not training and the
model. The engine file has no issues as the appname also given correctly.
It always shows "Data set is empty". This same setup is working fine with
Ubuntu 14 version. I havent made any config changes to make it run in
centos.

Let me know what will be the reason for this issue as the data is present
in Hbase but the PIO engine fails to detect it.

Thanks
Bala
--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to actionml-user+unsubscr...@googlegroups.com.
To post to this group, send email to actionml-u...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/CABdDaRqqpGcPb%3DZD-ms6i5OzY8_JdLQ3YbbcapS_dS8TxkGidQ%40mail.gmail.com

.
For more options, visit https://groups.google.com/d/optout.


Re: Distinct recommendation from "random" backfill?

2018-08-28 Thread Pat Ferrel
The random ranking is assigned after every `pio train` so if you have not
trained in-between, they will be the same. Random is not really meant to do
what you are using it for, it is meant to surface items with no data—no
primary events. This will allow some to get real events and be recommended
for the events next time you train. It is meant to fill in when you ask for
20 recs but there are only 10 things to be recommended. Proper use of this
with frequent training will cause items with no data to be purchased and to
therefore get data. The reason rankings are assigned at train time is that
this is the only way to get all of the business rules applied to the query
as well as a random ranking. In other words the ranking must be built into
the model with `pio train`

If you want to recommend random items each time you query, create a list of
item ids from your catalog and return some random sample each query
yourself. This should be nearly trivial.


From: Brian Chiu  
Reply: user@predictionio.apache.org 

Date: August 28, 2018 at 1:51:24 AM
To: u...@predictionio.incubator.apache.org


Subject:  Distinct recommendation from "random" backfill?

Dear pio developers and users:

I have been using predictionIO and Universal Recommender for a while.
In universal recommender engiene.json, there is a configuration field
`rankings`, and one of the option is random. Initially I thought it
would give each item without any related event some random recommended
items, and each of the recommendation list is different. However, it
turns out all of the random recommended item list is the same. For
example, if both item "6825991" and item "682599" have no events
during training, the result will be

```
$ curl -H "Content-Type: application/json" -d '{ "item": "6825991" }'
http://localhost:8000/queries.json
{"itemScores":[{"item":"8083748","score":0.0},{"item":"7942100","score":0.0},{"item":"8016271","score":0.0},{"item":"7731061","score":0.0},{"item":"8002458","score":0.0},{"item":"7763317","score":0.0},{"item":"8141119","score":0.0},{"item":"8080694","score":0.0},{"item":"7994844","score":0.0},{"item":"7951667","score":0.0},{"item":"7948453","score":0.0},{"item":"8148479","score":0.0},{"item":"8113083","score":0.0},{"item":"8041124","score":0.0},{"item":"8004823","score":0.0},{"item":"8126058","score":0.0},{"item":"8093042","score":0.0},{"item":"8064036","score":0.0},{"item":"8022524","score":0.0},{"item":"7977131","score":0.0}]}

$ curl -H "Content-Type: application/json" -d '{ "item": "682599" }'
http://localhost:8000/queries.json
{"itemScores":[{"item":"8083748","score":0.0},{"item":"7942100","score":0.0},{"item":"8016271","score":0.0},{"item":"7731061","score":0.0},{"item":"8002458","score":0.0},{"item":"7763317","score":0.0},{"item":"8141119","score":0.0},{"item":"8080694","score":0.0},{"item":"7994844","score":0.0},{"item":"7951667","score":0.0},{"item":"7948453","score":0.0},{"item":"8148479","score":0.0},{"item":"8113083","score":0.0},{"item":"8041124","score":0.0},{"item":"8004823","score":0.0},{"item":"8126058","score":0.0},{"item":"8093042","score":0.0},{"item":"8064036","score":0.0},{"item":"8022524","score":0.0},{"item":"7977131","score":0.0}]}

```

But I my webpage, whenever user click on these products without
events, they will see exactly the same recommended items, making it
looks boring. Is there anyway to give each item distinct random list?
Even if it is generated dynamically is OK. If you have any other
alternative, please also tell me.

Thanks all developers!

Best Regards,
Brian


Why are these going to the incubator address?

2018-08-24 Thread Pat Ferrel
Is it necessary these commits are going to the incubator list? Are
notifications setup wrong?


From: git-site-r...@apache.org 

Reply: dev@predictionio.apache.org 

Date: August 24, 2018 at 10:33:34 AM
To: comm...@predictionio.incubator.apache.org


Subject:  [7/7] predictionio-site git commit: Documentation based on <

comm...@predictionio.incubator.apache.org>

apache/predictionio#fc481c9c82989e1b484ea5bfeb540bc96758bed5

Documentation based on
apache/predictionio#fc481c9c82989e1b484ea5bfeb540bc96758bed5


Project: http://git-wip-us.apache.org/repos/asf/predictionio-site/repo
Commit:
http://git-wip-us.apache.org/repos/asf/predictionio-site/commit/107116dc
Tree: http://git-wip-us.apache.org/repos/asf/predictionio-site/tree/107116dc
Diff: http://git-wip-us.apache.org/repos/asf/predictionio-site/diff/107116dc

Branch: refs/heads/asf-site
Commit: 107116dc22c9d6c5467ba0c1506c61b6a9e10e32
Parents: c17b960
Author: jenkins 
Authored: Fri Aug 24 17:33:22 2018 +
Committer: jenkins 
Committed: Fri Aug 24 17:33:22 2018 +

--
datacollection/batchimport/index.html | 4 +-
datacollection/channel/index.html | 6 +-
datacollection/eventapi/index.html | 20 +-
gallery/template-gallery/index.html | 2 +-
gallery/templates.yaml | 14 +
samples/tabs/index.html | 18 +-
sitemap.xml | 260 +--
templates/classification/quickstart/index.html | 30 +--
.../complementarypurchase/quickstart/index.html | 20 +-
.../quickstart/index.html | 60 ++---
.../quickstart/index.html | 60 ++---
templates/leadscoring/quickstart/index.html | 30 +--
templates/productranking/quickstart/index.html | 40 +--
templates/recommendation/quickstart/index.html | 30 +--
templates/similarproduct/quickstart/index.html | 40 +--
templates/vanilla/quickstart/index.html | 10 +-
16 files changed, 329 insertions(+), 315 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/predictionio-site/blob/107116dc/datacollection/batchimport/index.html
--
diff --git a/datacollection/batchimport/index.html
b/datacollection/batchimport/index.html
index b4c83b7..4b739a3 100644
--- a/datacollection/batchimport/index.html
+++ b/datacollection/batchimport/index.html
@@ -7,7 +7,7 @@
{"event":"rate","entityType":"user","entityId":"3","targetEntityType":"item","targetEntityId":"2","properties":{"rating":1.0},"eventTime":"2014-11-21T01:04:14.729Z"}
{"event":"buy","entityType":"user","entityId":"3","targetEntityType":"item","targetEntityId":"7","eventTime":"2014-11-21T01:04:14.735Z"}
{"event":"buy","entityType":"user","entityId":"3","targetEntityType":"item","targetEntityId":"8","eventTime":"2014-11-21T01:04:14.741Z"}
-  Please make sure your import file does not contain any empty
lines. Empty lines will be treated as a null object and will return an
error during import.Use SDK to Prepare Batch Input FileSome of
the Apache PredictionIO SDKs also provides FileExporter client. You may use
them to prepare the JSON file as described above. The FileExporter creates
event in the same way as EventClient except that the events are written to
a JSON file instead of being sent to EventSever. The written JSON file can
then be used by batch import. 
PHP
SDK Python SDK 
Ruby SDK Java SDK 
 (coming soon) 1
+  Please make sure your import file does not contain any empty
lines. Empty lines will be treated as a null object and will return an
error during import.Use SDK to Prepare Batch Input FileSome of
the Apache PredictionIO SDKs also provides FileExporter client. You may use
them to prepare the JSON file as described above. The FileExporter creates
event in the same way as EventClient except that the events are written to
a JSON file instead of being sent to EventSever. The written JSON file can
then be used by batch import. 
PHP
SDK Python SDK 
Ruby SDK Java SDK 
 (coming soon) 1
2
3
4
@@ -58,7 +58,7 @@
# close the FileExporter when finish writing all
events
exporter.close()

- (coming
soon)   
 1 (coming
soon)
+ (coming
soon)   
 1 (coming
soon)
 Import Events
from Input FileImporting events from a file can be done easily
using the command line interface. Assuming that pio be in your
search path, your App ID be 123, and the input file
my_events.json be in your current working directory:1$
pio import --appid 123 --input my_events.json
  After a brief while, the tool
should return to the console without any error. Congratulations! You have
successfully imported your events.CommunityDownloadDocsGitHubSubscribe to User
Mailing ListStackoverflowContributeContributeSource CodeBug
Trackermailto:dev-subscr...@predictionio.apache.org;
target="blank">Subscribe to Development Mailing
ListApache PredictionIO, PredictionIO, Apache, the
Apache feather logo, and the Apache PredictionIO project logo are either
registered trademarks or trademarks of The Apache Software 

Re: PredictionIO spark deployment in Production

2018-08-07 Thread Pat Ferrel
Oh and no it does not need a new context for every query, only for the
deploy.


From: Pat Ferrel  
Date: August 7, 2018 at 10:00:49 AM
To: Ulavapalle Meghamala 

Cc: user@predictionio.apache.org 
, actionml-user
 
Subject:  Re: PredictionIO spark deployment in Production

The answers to your question illustrate why IMHO it is bad to have Spark
required for predictions.

Any of the MLlib ALS recommenders use Spark to predict and so run Spark
during the time they are deployed.. They can use one machine or use the
entire cluster. This is one case where using the cluster slows down
predictions since part of the model may be spread across nodes. Spark is
not designed to scale in this manner for real-time queries but I believe
those are your options out of the box for the ALS recommenders.

To be both fast and scalable you would load the model entirely into memory
on one machine for fast queries then spread queries across many identical
machines to scale load. I don’t think any templates do this—it requires a
load balancer at very least, not to mention custom deployment code that
interferes with using the same machines for training.

The UR loads the model into Elasticsearch for serving independently
scalable queries.

I always advise you keep Spark out of serving for the reasons mentioned
above.


From: Ulavapalle Meghamala 

Date: August 7, 2018 at 9:27:46 AM
To: Pat Ferrel  
Cc: user@predictionio.apache.org 
, actionml-user
 
Subject:  Re: PredictionIO spark deployment in Production

Thanks Pat for getting back.

Are there any PredictionIO models/templates which really use Spark in "pio
deploy" ? (not just loading the Spark Context for loading the 'pio deploy'
driver and then dropping the Spark Context), but a running Spark Context
through out the Prediction Server life cycle ? Or How does Prediction IO
handle this case ? Does it create a new Spark Context every time a
prediction has to be done ?

Also, in the production deployments(where Spark is not really used), how do
you scale Prediction Server ? Do you just deploy same model on multiple
machines and have a LB/HA Proxy to handle requests?

Thanks,
Megha



On Tue, Aug 7, 2018 at 9:35 PM, Pat Ferrel  wrote:

> PIO is designed to use Spark in train and deploy. But the Universal
> Recommender removes the need for Spark to make predictions. This IMO is a
> key to use Spark well—remove it from serving results. PIO creates a Spark
> context to launch the `pio deploy' driver but Spark is never used and the
> context is dropped.
>
> The UR also does not need to be re-deployed after each train. It hot swaps
> the new model into use outside of Spark and so if you never shut down the
>  PredictionServer you never need to re-deploy.
>
> The confusion comes from reading Apache PIO docs which may not do things
> this way—don’t read them. Each template defines it’s own requirements. To
> use the UR stick with it’s documentation.
>
> That means Spark is used to “train” only and you never re-deploy. Deploy
> once—train periodically.
>
>
> From: Ulavapalle Meghamala 
> 
> Reply: user@predictionio.apache.org 
> 
> Date: August 7, 2018 at 4:13:39 AM
> To: user@predictionio.apache.org 
> 
> Subject:  PredictionIO spark deployment in Production
>
> Hi,
>
> Are there any templates in PredictionIO where "spark" is used even in "pio
> deploy" ? How are you handling such cases ? Will you create a spark context
> every time you run a prediction ?
>
> I have gone through then documentation here: http://actionml.com/docs/
> single_driver_machine. But, it only talks about "pio train". Please guide
> me to any documentation that is available on the "pio deploy" with spark ?
>
> Thanks,
> Megha
>
>
--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to actionml-user+unsubscr...@googlegroups.com.
To post to this group, send email to actionml-u...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/CAOtZQD-KRpqz-Po6%3D%2BL2WhUh7kKa64yGihP44iSNdqb9nFE0Dg%40mail.gmail.com
<https://groups.google.com/d/msgid/actionml-user/CAOtZQD-KRpqz-Po6%3D%2BL2WhUh7kKa64yGihP44iSNdqb9nFE0Dg%40mail.gmail.com?utm_medium=email_source=footer>
.
For more options, visit https://groups.google.com/d/optout.


Re: PredictionIO spark deployment in Production

2018-08-07 Thread Pat Ferrel
The answers to your question illustrate why IMHO it is bad to have Spark
required for predictions.

Any of the MLlib ALS recommenders use Spark to predict and so run Spark
during the time they are deployed.. They can use one machine or use the
entire cluster. This is one case where using the cluster slows down
predictions since part of the model may be spread across nodes. Spark is
not designed to scale in this manner for real-time queries but I believe
those are your options out of the box for the ALS recommenders.

To be both fast and scalable you would load the model entirely into memory
on one machine for fast queries then spread queries across many identical
machines to scale load. I don’t think any templates do this—it requires a
load balancer at very least, not to mention custom deployment code that
interferes with using the same machines for training.

The UR loads the model into Elasticsearch for serving independently
scalable queries.

I always advise you keep Spark out of serving for the reasons mentioned
above.


From: Ulavapalle Meghamala 

Date: August 7, 2018 at 9:27:46 AM
To: Pat Ferrel  
Cc: user@predictionio.apache.org 
, actionml-user
 
Subject:  Re: PredictionIO spark deployment in Production

Thanks Pat for getting back.

Are there any PredictionIO models/templates which really use Spark in "pio
deploy" ? (not just loading the Spark Context for loading the 'pio deploy'
driver and then dropping the Spark Context), but a running Spark Context
through out the Prediction Server life cycle ? Or How does Prediction IO
handle this case ? Does it create a new Spark Context every time a
prediction has to be done ?

Also, in the production deployments(where Spark is not really used), how do
you scale Prediction Server ? Do you just deploy same model on multiple
machines and have a LB/HA Proxy to handle requests?

Thanks,
Megha



On Tue, Aug 7, 2018 at 9:35 PM, Pat Ferrel  wrote:

> PIO is designed to use Spark in train and deploy. But the Universal
> Recommender removes the need for Spark to make predictions. This IMO is a
> key to use Spark well—remove it from serving results. PIO creates a Spark
> context to launch the `pio deploy' driver but Spark is never used and the
> context is dropped.
>
> The UR also does not need to be re-deployed after each train. It hot swaps
> the new model into use outside of Spark and so if you never shut down the
>  PredictionServer you never need to re-deploy.
>
> The confusion comes from reading Apache PIO docs which may not do things
> this way—don’t read them. Each template defines it’s own requirements. To
> use the UR stick with it’s documentation.
>
> That means Spark is used to “train” only and you never re-deploy. Deploy
> once—train periodically.
>
>
> From: Ulavapalle Meghamala 
> 
> Reply: user@predictionio.apache.org 
> 
> Date: August 7, 2018 at 4:13:39 AM
> To: user@predictionio.apache.org 
> 
> Subject:  PredictionIO spark deployment in Production
>
> Hi,
>
> Are there any templates in PredictionIO where "spark" is used even in "pio
> deploy" ? How are you handling such cases ? Will you create a spark context
> every time you run a prediction ?
>
> I have gone through then documentation here: http://actionml.com/docs/
> single_driver_machine. But, it only talks about "pio train". Please guide
> me to any documentation that is available on the "pio deploy" with spark ?
>
> Thanks,
> Megha
>
>


Re: PredictionIO spark deployment in Production

2018-08-07 Thread Pat Ferrel
PIO is designed to use Spark in train and deploy. But the Universal
Recommender removes the need for Spark to make predictions. This IMO is a
key to use Spark well—remove it from serving results. PIO creates a Spark
context to launch the `pio deploy' driver but Spark is never used and the
context is dropped.

The UR also does not need to be re-deployed after each train. It hot swaps
the new model into use outside of Spark and so if you never shut down the
 PredictionServer you never need to re-deploy.

The confusion comes from reading Apache PIO docs which may not do things
this way—don’t read them. Each template defines it’s own requirements. To
use the UR stick with it’s documentation.

That means Spark is used to “train” only and you never re-deploy. Deploy
once—train periodically.


From: Ulavapalle Meghamala 

Reply: user@predictionio.apache.org 

Date: August 7, 2018 at 4:13:39 AM
To: user@predictionio.apache.org 

Subject:  PredictionIO spark deployment in Production

Hi,

Are there any templates in PredictionIO where "spark" is used even in "pio
deploy" ? How are you handling such cases ? Will you create a spark context
every time you run a prediction ?

I have gone through then documentation here:
http://actionml.com/docs/single_driver_machine. But, it only talks about
"pio train". Please guide me to any documentation that is available on the
"pio deploy" with spark ?

Thanks,
Megha


Re: Straw poll: deprecating Scala 2.10 and Spark 1.x support

2018-08-02 Thread Pat Ferrel
+1


From: takako shimamoto  
Reply: user@predictionio.apache.org 

Date: August 2, 2018 at 2:55:49 AM
To: d...@predictionio.apache.org 
, user@predictionio.apache.org
 
Subject:  Straw poll: deprecating Scala 2.10 and Spark 1.x support

Hi all,

We're considering deprecating Scala 2.10 and Spark 1.x as of
the next release. Our intent is that using deprecated versions
can generate warnings, but that it should still work.

Nothing is concrete about actual removal of support at the moment, but
moving forward, use of Scala 2.11 and Spark 2.x will be recommended.
I think it's time to plan to deprecate 2.10 support, especially
with 2.12 coming soon.

This has an impact on some users, so if you see any issues with this,
please let us know as soon as possible.

Regards,
Takako


Re: 2 pio servers with 1 event server

2018-08-02 Thread Pat Ferrel
What template?


From: Sami Serbey 

Reply: user@predictionio.apache.org 

Date: August 2, 2018 at 9:08:05 AM
To: user@predictionio.apache.org 

Subject:  2 pio servers with 1 event server

Greetings,

I am trying to run 2 pio servers on different ports where each server have
his own app. When I deploy the first server, I get the results I want for
prediction on that server. However, after deplying the second server on a
different port, the results from the first server got changed. Any idea on
how can I fix that?

Or is there some kind of procedures I should follow to be able to run 2
prediction servers from 2 different app but share the same template?

Regards,
Sami serbey


Re: [actionml/universal-recommender] Boosting categories only shows one category type (#55)

2018-07-06 Thread Pat Ferrel
Please read the docs. There is no need to $set users since they are
attached to usage events and can be detected automatically. In fact
"$set"ting them is ignored. There are no properties of users that are not
calculated based on named “indicators’, which can be profile type things.

Fot this application I’d ask myself what you want the user to do? Do you
want them to view a house listing or schedule a visit? Clearly you want
them to rent but there will only be one rent per user so it is not very
strong for indicating taste.

If you have something like 10 visits per user on average you may have
enough to use as the primary indicator since visits are closer to “rent”,
intuitively, Page views, which may be 10x - 100x more than visits are your
last resort. But if page views is the best “primary” indicators you have,
still use visits and rents as secondary. Users have many motivations for
looking at listing and they may be only to look at higher priced units that
they have any intent of renting or to compare something they would not rent
to what they would. Therefor page views are somewhat removed from the pure
user intent of every “rent” but they may be the best indicator you have.

Also consider using things like search terms as secondary indicators.

Then send primary and all secondary events with whatever ids correspond to
the event type. User profile data is harder to use and not a useful as
people think but is still encoded as an indicator but with different
“eventName”. Something like “location” could be used and might have and id
like a postal code—something that is large enough to include other users
but small enough to be exclusive also.

The above will give you several “usage events” with one primary.

Business rule—which are used to restrict results—require you to $set
properties for every rental. So anything in the fields part of a query must
correspond to a possible property of items. Those look ok below.

Please use the Google group for questions. Github is for bug reports.


From: Amit Assaraf  
Reply: actionml/universal-recommender


Date: July 6, 2018 at 10:11:10 AM
To: actionml/universal-recommender


Cc: Subscribed 

Subject:  [actionml/universal-recommender] Boosting categories only shows
one category type (#55)

I have an app that uses Universal Recommender. The app is an app for
finding a house for rent.
I want to recommend users houses based on houses they viewed or scheduled a
tour on already.

I added all the users using the $set event.
I added all (96,676) the houses in the app like so:

predictionio_client.create_event(
event="$set",
entity_type="item",
entity_id=listing.meta.id,
properties={
  "property_type": ["villa"] # There are many
types of property_types such as "apartment"
}
)

And I add the events of the house view & schedule like so:

predictionio_client.create_event(
event="view",
entity_type="user",
entity_id=request.user.username,
target_entity_type="item",
target_entity_id=listing.meta.id
)

Now I want to get predictions for my users based on the property_types they
like.
So I send a prediction query boosting the property_types they like using
Business Rules like so:

{
'fields': [
{
 'bias': 1.05,
 'values': ['single_family_home', 'private_house',
'villa', 'cottage'],
 'name': 'property_type'
}
 ],
 'num': 15,
 'user': 'amit70'
}

Which I would then expect that I would get recommendations of different
types such as private_house or villa or cottage. But for some weird reason
while having over 95,000 houses of different property types I only get
recommendations of *ONE* single type (in this case villa) but if I remove
it from the list it just recommends 10 houses of ONE different type.
This is the response of the query:

{
"itemScores": [
{
"item": "56.39233,-4.11707|villa|0",
"score": 9.42542
},
{
"item": "52.3288,1.68312|villa|0",
"score": 9.42542
},
{
"item": "55.898878,-4.617019|villa|0",
"score": 8.531346
},
{
"item": "55.90713,-3.27626|villa|0",
"score": 8.531346
},
.

I cant understand why this is happening. The elasticsearch query this
translates to is this:
GET /recommender/_search

{
  "from": 0,
  "size": 15,
  "query": {
"bool": {
  "should": [
{
  "terms": {
"schedule": [
  "32.1439352176,34.833260278|private_house|0",
  "31.7848439,35.2047335|apartment_for_sale|0"
]
  }
},
{
  "terms": {
"view": [
  "32.0734919,34.7722675|garden_apartment|0",
  "32.1375986782,34.8415740159|apartment|0",
  

Re: Digging into UR algorithm

2018-07-02 Thread Pat Ferrel
The CCO algorithm test for correlation with a statistic called the Log
Likelihood Ratio (LLR). This compares relative frequencies of 4 different
things 2 having to do with the entire dataset 2 having to do with the 2
events being compared for correlation. Popularity is normalized out of this
comparison but does play a small indirect part in having engough data to
make better guesses about correlation.

Also remember that the secondary event may have item-ids that are not part
of the primary event. For instance if you have good search data then one
(of several) secondary event might be (user-id, "searched-for”,
search-term) This as a secondary event has proven to be quite useful in at
least one dataset I’ve seen.


From: Pat Ferrel  
Reply: Pat Ferrel  
Date: July 2, 2018 at 12:18:16 PM
To: user@predictionio.apache.org 
, Sami Serbey 

Cc: actionml-user 

Subject:  Re: Digging into UR algorithm

The only requirement is that someone performed the primary event on A and
the secondary event is correlated to that primary event.

the UR can recommend to a user who has only performed the secondary event
on B as long as that is in the model. Makes no difference what subset of
events the user has performed, recommendations will work event if the user
has no primary events.

So think of the model as being separate from the user history of events.
Recs are made from user history—whatever it is, but the model must have
some correlated data for each event type you want to use from a user’s
history and sometimes on infrequently seen items there is no model data for
some event types.

Popularity has very little to do with recommendations except for the fact
that you are more likely to have good correlated events. In fact we do
things to normalize/down weight highly popular things because otherwise
recommendations are worse. You can tell this by doing cross-validation
tests for popular vs collaborative filtering using the CCO algorithm behind
the UR.

If you want popular items you can make a query with no user-id and you will
get the most popular. Also if there are not enough recommendations for a
user’s history data we fill in with popular.

Your questions don’t quite match how the algorithm works so hopefully this
straightens out some things.

BTW community support for the UR is here:
https://groups.google.com/forum/#!forum/actionml-user


From: Sami Serbey 

Reply: user@predictionio.apache.org 

Date: July 2, 2018 at 9:32:01 AM
To: user@predictionio.apache.org 

Subject:  Digging into UR algorithm

Hi guys,

So I've been playing around with the UR algorithm and I would like to know
2 things if it is possible:

1- Does UR recommend items that are linked to primary event only? Like if
item A is pruchased (primary event) 1 time and item B is liked (secondary
event) 50 times, does UR only recommend item A as the popular one even
though item B have x50 secondary event? Is there a way to play around this?

2- When I first read about UR I thought that it recommend items based on
the frequency of secondary events to primary events. ie: if 50 likes
(secondary event) of item A lead to the purchase of item B and 1 view
(secondary event) of item A lead to the purchase of item C, when someone
view and like item A he will get recommended item B and C with equal score
disregarding the 50 likes vs 1 view. Is that the correct behavior or am I
missing something? Does all secondary event have same weight of influence
for the recommender?

I hope that you can help me out understanding UR template.

Regards,
Sami Serbey



--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to actionml-user+unsubscr...@googlegroups.com.
To post to this group, send email to actionml-u...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/CAOtZQD8CU5fVvZ9C32Cj6YaC1F%2B7oxWF%2Br21ApKnuajOZOFuoA%40mail.gmail.com
<https://groups.google.com/d/msgid/actionml-user/CAOtZQD8CU5fVvZ9C32Cj6YaC1F%2B7oxWF%2Br21ApKnuajOZOFuoA%40mail.gmail.com?utm_medium=email_source=footer>
.
For more options, visit https://groups.google.com/d/optout.


Re: Digging into UR algorithm

2018-07-02 Thread Pat Ferrel
The only requirement is that someone performed the primary event on A and
the secondary event is correlated to that primary event.

the UR can recommend to a user who has only performed the secondary event
on B as long as that is in the model. Makes no difference what subset of
events the user has performed, recommendations will work event if the user
has no primary events.

So think of the model as being separate from the user history of events.
Recs are made from user history—whatever it is, but the model must have
some correlated data for each event type you want to use from a user’s
history and sometimes on infrequently seen items there is no model data for
some event types.

Popularity has very little to do with recommendations except for the fact
that you are more likely to have good correlated events. In fact we do
things to normalize/down weight highly popular things because otherwise
recommendations are worse. You can tell this by doing cross-validation
tests for popular vs collaborative filtering using the CCO algorithm behind
the UR.

If you want popular items you can make a query with no user-id and you will
get the most popular. Also if there are not enough recommendations for a
user’s history data we fill in with popular.

Your questions don’t quite match how the algorithm works so hopefully this
straightens out some things.

BTW community support for the UR is here:
https://groups.google.com/forum/#!forum/actionml-user


From: Sami Serbey 

Reply: user@predictionio.apache.org 

Date: July 2, 2018 at 9:32:01 AM
To: user@predictionio.apache.org 

Subject:  Digging into UR algorithm

Hi guys,

So I've been playing around with the UR algorithm and I would like to know
2 things if it is possible:

1- Does UR recommend items that are linked to primary event only? Like if
item A is pruchased (primary event) 1 time and item B is liked (secondary
event) 50 times, does UR only recommend item A as the popular one even
though item B have x50 secondary event? Is there a way to play around this?

2- When I first read about UR I thought that it recommend items based on
the frequency of secondary events to primary events. ie: if 50 likes
(secondary event) of item A lead to the purchase of item B and 1 view
(secondary event) of item A lead to the purchase of item C, when someone
view and like item A he will get recommended item B and C with equal score
disregarding the 50 likes vs 1 view. Is that the correct behavior or am I
missing something? Does all secondary event have same weight of influence
for the recommender?

I hope that you can help me out understanding UR template.

Regards,
Sami Serbey


[jira] [Updated] (MAHOUT-2048) There are duplicate content pages which need redirects instead

2018-06-27 Thread Pat Ferrel (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-2048:
---
Sprint: 0.14.0 Release

> There are duplicate content pages which need redirects instead
> --
>
> Key: MAHOUT-2048
> URL: https://issues.apache.org/jira/browse/MAHOUT-2048
> Project: Mahout
>  Issue Type: Planned Work
>  Components: website
>Affects Versions: 0.13.0, 0.14.0
>    Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Minor
> Fix For: 0.13.0, 0.14.0
>
>
> I have duplicated content in 3 places in the `website/` directory. We need to 
> have one place for the real content and replace the dups with redirect to the 
> actual content. This looks like is may be true for several other pages and 
> honestly I'm not sure if they are all needed but there are many links out in 
> the wild that point to the old path for the CCO recommender pages so we 
> should do this for the ones below at least. Better yet we may want to clean 
> out any other dups unless someone knows why not.
> TLDR;
> Actual content:
> mahout/website/docs/latest/algorithms/recommenders/index.md
> mahout/website/docs/latest/algorithms/recommenders/cco.md
>  
> Dups to be replaced with redirects to the above content. I vaguely remember 
> all these different site structures so there may be links to them in the wild.
> mahout/website/recommender-overview.md => 
> mahout/website/docs/latest/algorithms/recommenders/index.md
> mahout/website/users/algorithms/intro-cooccurrence-spark.md => 
> mahout/website/docs/latest/algorithms/recommenders/cco.md
> mahout/website/users/recommender/quickstart.md => 
> mahout/website/docs/latest/algorithms/recommenders/index.md
> mahout/website/users/recommender/intro-cooccurrence-spark.md => 
> mahout/website/docs/latest/algorithms/recommenders/cco.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MAHOUT-2048) There are duplicate content pages which need redirects instead

2018-06-27 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-2048:
--

 Summary: There are duplicate content pages which need redirects 
instead
 Key: MAHOUT-2048
 URL: https://issues.apache.org/jira/browse/MAHOUT-2048
 Project: Mahout
  Issue Type: Planned Work
  Components: website
Affects Versions: 0.13.0
Reporter: Pat Ferrel
Assignee: Andrew Musselman
 Fix For: 0.13.0


I have duplicated content in 3 places in the `website/` directory. We need to 
have one place for the real content and replace the dups with redirect to the 
actual content. This looks like is may be true for several other pages and 
honestly I'm not sure if they are all needed but there are many links out in 
the wild that point to the old path for the CCO recommender pages so we should 
do this for the ones below at least. Better yet we may want to clean out any 
other dups unless someone knows why not.



TLDR;

Actual content:

mahout/website/docs/latest/algorithms/recommenders/index.md

mahout/website/docs/latest/algorithms/recommenders/cco.md

 

Dups to be replaced with redirects to the above content. I vaguely remember all 
these different site structures so there may be links to them in the wild.


mahout/website/recommender-overview.md => 
mahout/website/docs/latest/algorithms/recommenders/index.md

mahout/website/users/algorithms/intro-cooccurrence-spark.md => 
mahout/website/docs/latest/algorithms/recommenders/cco.md

mahout/website/users/recommender/quickstart.md => 
mahout/website/docs/latest/algorithms/recommenders/index.md

mahout/website/users/recommender/intro-cooccurrence-spark.md => 
mahout/website/docs/latest/algorithms/recommenders/cco.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: a question about a high availability of Elasticsearch cluster

2018-06-22 Thread Pat Ferrel
This should work with any node down. Elasticsearch should elect a new
master.

What version of PIO are you using? PIO and the UR changed the client from
the transport client to the RET client in 0.12.0, which is why you are
using port 9200.

Do all PIO functions work correctly like:

   - pio app list
   - pio app new

with all the configs and missing nodes you describe? What I’m trying to
find out is if the problem is only with queries, which do use ES is a
different way.

What is the es.nodes setting in the engine.json’s sparkConf?


From: jih...@braincolla.com  
Date: June 22, 2018 at 12:53:48 AM
To: actionml-user 

Subject:  a question about a high availability of Elasticsearch cluster

Hello Pat,

May I have a question about Elasticsearch cluster in PIO and UR.

I've implemented some Elasticsearch cluster consisted of 3 nodes on below
options.

**
cluster.name: my-search-cluster
network.host: 0.0.0.0
discovery.zen.ping.unicast.hosts: [“node 1”, “node 2", “node 3”]

And I writed PIO options below.

**
...
# Elasticsearch Example
PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_SCHEMES=http
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=/usr/local/elasticsearch

# The next line should match the ES cluster.name in ES config
PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=my-search-cluster

# For clustered Elasticsearch (use one host/port if not clustered)
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=node1,node2,node3
PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9200,9200,9200
...

My questions are below.

1. I killed the Elasticsearch process in node 2 or node 3. PIO is well
working. But when the Elasticsearch process in node 1 is killed, PIO is not
working. Is it right?

2. I've changed PIO options below. I killed the Elasticsearch process in
node 1 or node 3. PIO is well working. But when the Elasticsearch in node 2
is killed, PIO is not working. Is it right?
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=node2,node1,node3

3. In my opinion, if first node configurd at
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS is killed, PIO is not working. Is
it right? If yes, please let me know why it happened.

Thank you.
--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to actionml-user+unsubscr...@googlegroups.com.
To post to this group, send email to actionml-u...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/254aed9f-c975-4726-8b90-2ea80d6a2a34%40googlegroups.com

.
For more options, visit https://groups.google.com/d/optout.


Re: UR trending ranking as separate process

2018-06-20 Thread Pat Ferrel
Yes, we support “popular”, “trending”, and “hot” as methods for ranking items. 
The UR queries are backfilled with these items if there are not enough results. 
So if the users has little history and so only gets 5 out of 10 results based 
on this history, we will automatically return the other 5 from the “popular” 
results. This is the default, if there is no specific config for this.

If you query with no user or item, we will return only from “popular” or 
whatever brand of ranking you have setup.

To change which type of ranking you want you can specify the period to use in 
calculating the ranking and which method from “popular”, “trending”, and “hot”. 
These roughly correspond to # of conversion, speed of conversion, and 
acceleration in conversions, if that helps.

Docs here: http://actionml.com/docs/ur_config Search for “rankings" 


From: Sami Serbey 
Reply: user@predictionio.apache.org 
Date: June 20, 2018 at 10:25:53 AM
To: user@predictionio.apache.org , Pat Ferrel 

Cc: user@predictionio.apache.org 
Subject:  Re: UR trending ranking as separate process  

Hi George,

I didn't get your question but I think I am missing something. So you're using 
the Universal Recommender and you're getting a sorted output based on the 
trending items? Is that really a thing in this template? May I please know how 
can you configure the template to get such output? I really hope you can answer 
that. I am also working with the UR template.

Regards,
Sami Serbey

Get Outlook for iOS
From: George Yarish 
Sent: Wednesday, June 20, 2018 7:45:12 PM
To: Pat Ferrel
Cc: user@predictionio.apache.org
Subject: Re: UR trending ranking as separate process
 
Matthew, Pat

Thanks for the answers and concerns. Yes, we want to calculate every 30 minutes 
trending for the last X hours, there X might be even few days. So realtime 
analogy is correct. 

On Wed, Jun 20, 2018 at 6:50 PM, Pat Ferrel  wrote:
No the trending algorithm is meant to look at something like trends over 2 
days. This is because it looks at 2 buckets of conversion frequencies and if 
you cut them smaller than a day you will have so much bias due to daily 
variations that the trends will be invalid. In other words the ups and downs 
over a day period need to be made irrelevant and taking day long buckets is the 
simplest way to do this. Likewise for “hot” which needs 3 buckets and so takes 
3 days worth of data. 

Maybe what you need is to just count conversions for 30 minutes as a realtime 
thing. For every item, keep conversions for the last 30 minutes, sort them 
periodically by count. This is a Kappa style algorithm doing online learning, 
not really supported by PredictionIO. You will have to experiment with the 
length of time since a too small period will be very noisy, popping back and 
forth between items semi-randomly.


From: George Yarish 
Reply: user@predictionio.apache.org 
Date: June 20, 2018 at 8:34:10 AM
To: user@predictionio.apache.org 
Subject:  UR trending ranking as separate process 

Hi!

Not sure this is correct place to ask, since my question correspond to UR 
specifically, not to pio itself I guess. 

Anyway, we are using UR template for predictionio and we are about to use 
trending ranking for sorting UR output. If I understand it correctly ranking is 
created during training and stored in ES. Our training takes ~ 3 hours and we 
launch it daily by scheduler but for trending rankings we want to get actual 
information every 30 minutes.

That means we want to separate training (scores calculation) and ranking 
calculation and launch them by different schedule.

Is there any easy way to achieve it? Does UR supports something like this?

Thanks,
George



-- 






George Yarish, Java Developer


Grid Dynamics


197101, Rentgena Str., 5A, St.Petersburg, Russia

Cell: +7 950 030-1941


Read Grid Dynamics' Tech Blog



Re: UR trending ranking as separate process

2018-06-20 Thread Pat Ferrel
No the trending algorithm is meant to look at something like trends over 2
days. This is because it looks at 2 buckets of conversion frequencies and
if you cut them smaller than a day you will have so much bias due to daily
variations that the trends will be invalid. In other words the ups and
downs over a day period need to be made irrelevant and taking day long
buckets is the simplest way to do this. Likewise for “hot” which needs 3
buckets and so takes 3 days worth of data.

Maybe what you need is to just count conversions for 30 minutes as a
realtime thing. For every item, keep conversions for the last 30 minutes,
sort them periodically by count. This is a Kappa style algorithm doing
online learning, not really supported by PredictionIO. You will have to
experiment with the length of time since a too small period will be very
noisy, popping back and forth between items semi-randomly.


From: George Yarish  
Reply: user@predictionio.apache.org 

Date: June 20, 2018 at 8:34:10 AM
To: user@predictionio.apache.org 

Subject:  UR trending ranking as separate process

Hi!

Not sure this is correct place to ask, since my question correspond to UR
specifically, not to pio itself I guess.

Anyway, we are using UR template for predictionio and we are about to use
trending ranking for sorting UR output. If I understand it correctly
ranking is created during training and stored in ES. Our training takes ~ 3
hours and we launch it daily by scheduler but for trending rankings we want
to get actual information every 30 minutes.

That means we want to separate training (scores calculation) and ranking
calculation and launch them by different schedule.

Is there any easy way to achieve it? Does UR supports something like this?

Thanks,
George


Re: java.util.NoSuchElementException: head of empty list when running train

2018-06-19 Thread Pat Ferrel
Yes, those instructions tell you to run HDFS in pseudo-cluster mode. What
do you see in the HDFS GUI on localhost:50070 ?

Those setup instructions create a pseudo-clustered Spark, and HDFS/HBase.
This runs on a single machine but as the page says, are configured so you
can easily expand to a cluster by replacing config to point to remote HDFS
or Spark clusters.

One fix, if you don’t want to run those services in pseudo-cluster mode is:

1) remove any mention of PGSQL or jdbc, we are not using it. These are not
found on the page you linked to and are not used.
2) on a single machine you can put the dummy/empty model file in LOCALFS so
change the lines
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=HDFS
PIO_STORAGE_SOURCES_HDFS_PATH=hdfs://localhost:9000/models
to
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE= LOCALFS
PIO_STORAGE_SOURCES_HDFS_PATH=/path/to/models
substituting with a directory where you want to save models

Running them in a pseudo-cluster mode gives you GUIs to see job progress
and browse HDFS for files, among other things. We recommend it for helping
to debug problems when you get to large amounts of data and begin running
out of resources.


From: Anuj Kumar  
Date: June 19, 2018 at 10:35:02 AM
To: p...@occamsmachete.com  
Cc: user@predictionio.apache.org 
, actionml-u...@googlegroups.com
 
Subject:  Re: java.util.NoSuchElementException: head of empty list when
running train

Hi Pat,
  Read it on the below link

http://actionml.com/docs/single_machine

here is the pio-env.sh

SPARK_HOME=$PIO_HOME/vendors/spark-2.1.1-bin-hadoop2.6

POSTGRES_JDBC_DRIVER=$PIO_HOME/lib/postgresql-42.0.0.jar

MYSQL_JDBC_DRIVER=$PIO_HOME/lib/mysql-connector-java-5.1.41.jar

HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

HBASE_CONF_DIR=/usr/local/hbase/conf

PIO_FS_BASEDIR=$HOME/.pio_store

PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines

PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp

PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta

PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH

PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event

PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model

PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=HDFS

PIO_STORAGE_SOURCES_PGSQL_TYPE=jdbc

PIO_STORAGE_SOURCES_PGSQL_URL=jdbc:postgresql://localhost/pio

PIO_STORAGE_SOURCES_PGSQL_USERNAME=pio

PIO_STORAGE_SOURCES_PGSQL_PASSWORD=pio

PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch

PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=/usr/local/els

PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=pio

PIO_STORAGE_SOURCES_HDFS_TYPE=hdfs

PIO_STORAGE_SOURCES_HDFS_PATH=hdfs://localhost:9000/models

PIO_STORAGE_SOURCES_HBASE_TYPE=hbase

PIO_STORAGE_SOURCES_HBASE_HOME=/usr/local/hbase

Thanks,
Anuj Kumar



On Tue, Jun 19, 2018 at 9:16 PM Pat Ferrel  wrote:

> Can you show me where on the AML site it says to store models in HDFS, it
> should not say that? I think that may be from the PIO site so you should
> ignore it.
>
> Can you share your pio-env? You need to go through the whole workflow from
> pio build, pio train, to pio deploy using a template from the same
> directory and with the same engine.json and pio-env and I suspect something
> is wrong in pio-env.
>
>
> From: Anuj Kumar 
> 
> Date: June 19, 2018 at 1:28:11 AM
> To: p...@occamsmachete.com  
> Cc: user@predictionio.apache.org 
> , actionml-u...@googlegroups.com
>  
> Subject:  Re: java.util.NoSuchElementException: head of empty list when
> running train
>
> Tried with basic engine.json mentioned at UL site examples. Seems to work
> but got stuck at "pio deploy" throwing following error
>
> [ERROR] [OneForOneStrategy] Failed to invert: [B@35c7052
>
>
> before that "pio train" was successful but gave following error. I suspect
> because of this reason "pio deploy" is not working. Please help
>
> [ERROR] [HDFSModels] File /models/pio_modelAWQXIr4APcDlNQi8DwVj could only
> be replicated to 0 nodes instead of minReplication (=1).  There are 0
> datanode(s) running and no node(s) are excluded in this operation.
>
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1726)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2565)
>
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:829)
>
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>
> at
> org.

Re: java.util.NoSuchElementException: head of empty list when running train

2018-06-19 Thread Pat Ferrel
Can you show me where on the AML site it says to store models in HDFS, it
should not say that? I think that may be from the PIO site so you should
ignore it.

Can you share your pio-env? You need to go through the whole workflow from
pio build, pio train, to pio deploy using a template from the same
directory and with the same engine.json and pio-env and I suspect something
is wrong in pio-env.


From: Anuj Kumar  
Date: June 19, 2018 at 1:28:11 AM
To: p...@occamsmachete.com  
Cc: user@predictionio.apache.org 
, actionml-u...@googlegroups.com
 
Subject:  Re: java.util.NoSuchElementException: head of empty list when
running train

Tried with basic engine.json mentioned at UL site examples. Seems to work
but got stuck at "pio deploy" throwing following error

[ERROR] [OneForOneStrategy] Failed to invert: [B@35c7052


before that "pio train" was successful but gave following error. I suspect
because of this reason "pio deploy" is not working. Please help

[ERROR] [HDFSModels] File /models/pio_modelAWQXIr4APcDlNQi8DwVj could only
be replicated to 0 nodes instead of minReplication (=1).  There are 0
datanode(s) running and no node(s) are excluded in this operation.

at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1726)

at
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)

at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2565)

at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:829)

at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)

at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)

at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)

at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)


On Tue, Jun 19, 2018 at 10:45 AM Anuj Kumar 
wrote:

> Sure, here it is.
>
> {
>
>   "comment":" This config file uses default settings for all but the
> required values see README.md for docs",
>
>   "id": "default",
>
>   "description": "Default settings",
>
>   "engineFactory": "com.actionml.RecommendationEngine",
>
>   "datasource": {
>
> "params" : {
>
>   "name": "sample-handmad",
>
>   "appName": "np",
>
>   "eventNames": ["read", "search", "view", "category-pref"],
>
>   "minEventsPerUser": 1,
>
>   "eventWindow": {
>
> "duration": "300 days",
>
> "removeDuplicates": true,
>
> "compressProperties": true
>
>   }
>
> }
>
>   },
>
>   "sparkConf": {
>
> "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
>
> "spark.kryo.registrator":
> "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
>
> "spark.kryo.referenceTracking": "false",
>
> "spark.kryoserializer.buffer": "300m",
>
> "spark.executor.memory": "4g",
>
> "spark.executor.cores": "2",
>
> "spark.task.cpus": "2",
>
> "spark.default.parallelism": "16",
>
> "es.index.auto.create": "true"
>
>   },
>
>   "algorithms": [
>
> {
>
>   "comment": "simplest setup where all values are default, popularity
> based backfill, must add eventsNames",
>
>   "name": "ur",
>
>   "params": {
>
> "appName": "np",
>
>     "indexName": "np",
>
> "typeName": "items",
>
> "blacklistEvents": [],
>
> "comment": "must have data for the first event or the model will
> not build, other events are optional",
>
> "indicators": [
>
>

Re: Few Queries Regarding the Recommendation Template

2018-06-13 Thread Pat Ferrel
Wow that page should be reworded or removed. They are trying to talk about
ensemble models, which are a valid thing but they badly misapply it there.
The application to multiple data types is just wrong and I know because I
tried exactly what they are suggesting but with cross-validation tests to
measure how much worse things got.

For instance if you use buy and dislike what kind of result are you going
to get if you have 2 models? One set of results will recommend “buy” the
other will tell you what a user is likely to “dislike”. How do you combine
them?

Ensembles are meant to use multiple *algorithms* and do something like
voting on recommendations. But you have to pay close attention to what the
algorithm uses as input and what it recommends. All members of the ensemble
must recommend the same action to the user.

Whoever contributed this statement: The default algorithm described in DASE
<https://predictionio.apache.org/templates/similarproduct/dase/#algorithm> uses
user-to-item view events as training data. However, your application may
have more than one type of events which you want to take into account, such
as buy, rate and like events. One way to incorporate other types of events
to improve the system is to add another algorithm to process these events,
build a separated model and then combine the outputs of multiple algorithms
during Serving.

Is patently wrong. Ensembles must recommend the same action to users and
unless each algorithm in the ensemble is recommending the same thing (all
be it with slightly different internal logic) then you will get gibberish
out. The winner of the Netflix prize did an ensemble with 107 (IIRC)
different algorithms all using exactly the same input data. There is no
principle that says if you feed conflicting data into several ensemble
algorithms that you will get diamonds out.

Furthermore using view events is bad to begin with because the recommender
will recommend what it thinks you want to view. We did this once with a
large dataset from a big E-Com company where we did cross-validation tests
using “buy” alone, “view” alone,  and ensembles of “buy” and “view”. We got
far better results using buy alone than using buy with ~100x as many
“views". The intent of the user and how they find things to view is so
different than when they finally come to buy something that adding view
data got significantly worse results. This is because people have different
reasons to view—maybe a flashy image, maybe a promotion, maybe some
placement bias, etc. This type of browsing “noise” pollutes the data which
can no longer be used to recommend “buy”s. We did several experiments
including comparing several algorithms types with “buy” and “view” events.
“view” always lost to “buy” no matter the algo we used (they were all
unimodal). There may be some exception to this result out there but it will
be accidental, not because it is built into the algorithm. When I say this
worsened results I’m not talking about some tiny fraction of a %, I’m
talking about a decrease of 15-20%

You could argue that “buy”, “like”, and rate will produce similar results
but from experience I can truly say that view and dislike will not.

Since the method described on the site is so sensitive to the user intent
recorded in events I would never use something like that without doing
cross-validation tests and then you are talking about a lot of work. There
is no theoretical or algorithmic correlation detection built into the
ensemble method so you may or may not get good results and I can say
unequivocally that the exact thing they describe will give worse results
(or at least it did in our experiments). You cannot ignore the intent
behind the data you use as input unless this type of correlation detection
is built into the algorithm and with the ensemble method described this
issue is completely ignored.

The UR uses the Correlated Cross-Occurrence algorithm for this exact reason
and was invented to solve the problem we found using “buy” and “view” data
together.  Let’s take a ridiculous extreme and use “dislikes" to recommend
“likes”? Does that even make sense? Check out an experiment with CCO where
we did this exact thing:
https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/

OK, rant over :-) Thanks for bringing up one of the key issues being
addressed by modern recommenders—multimodality. It is being addressed in
scientific ways, unfortunately the page on PIO’s site gets it wrong.




From: KRISH MEHTA  
Reply: KRISH MEHTA  
Date: June 13, 2018 at 2:19:17 PM
To: Pat Ferrel  
Subject:  Re: Few Queries Regarding the Recommendation Template

I Understand but if I just want the likes, dislikes and views then I can
combine the algorithms right? Given in the link:
https://predictionio.apache.org/templates/similarproduct/multi-events-multi-algos/
I
hope this works.

On Jun 13, 2018, at 1:19 PM, Pat Ferrel  wrote:

I would strongly recommend against using rati

Re: True Negative - ROC Curve

2018-06-12 Thread Pat Ferrel
We do not use these for recommenders. The precision rate is low when the
lift in your KPI like sales is relatively high. This is not like
classification.

We use MAP@k with increasing values of k. This should yield a diminishing
mean average precision chart with increasing k. This tells you 2 things; 1)
you are guessing in the right order, Map@1 greater than MAP@2 means your
first guess is better than than your second. The rate of decrease tells you
how fast the precision drops off with higher k. And 2) the baseline MAP@k
for future comparisons to tuning your engine or in champion/challenger
comparisons before putting into A/B tests.

Also note that RMSE has been pretty much discarded as an offline metric for
recommenders, it only really gives you a metric for ratings, and who cares
about that. No one wants to optimize rating guess anymore, conversions are
all that matters and precision is the way to measure potential conversion
since it actually measures how precise our guess about that the user
actually converted on in the test set. Ranking is next most important since
you have a limited number of recommendations to show, you want the best
ranked first. MAP@k over a range of k does this but clients often try to
read sales lift in this and there is no absolute relationship. You can
guess at one once you have A/B test results, and you should also compare
non-recommendation results like random recs, or popular recs. If MAP is
lower or close to these, you may not have a good recommender or data.

AUC is not for every task. In this case the only positive is a conversion
in the test data and the only negative is the absence of conversion and the
ROC curve will be nearly useless


From: Nasos Papageorgiou 

Reply: user@predictionio.apache.org 

Date: June 12, 2018 at 7:17:04 AM
To: user@predictionio.apache.org 

Subject:  True Negative - ROC Curve

Hi all,

I want to use ROC curve (AUC - Area Under the Curve) for evaluation of
recommended system in case of retailer. Could you please give an example of
True Negative value?

i.e. True Positive is the number of items on the Recommended List that are
appeared on the test data set, where the test data set may be the 20%  of
the full data.

Thank you.




Virus-free.
www.avast.com

<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>


Re: Regarding Real-Time Prediction

2018-06-11 Thread Pat Ferrel
Actually if you are using the Universal Recommender you only need to deploy 
once as long as the engine.json does not change. The hot swap happens as 
@Digambar says and there is literally no downtime. If you are using any of the 
other recommenders you do have to re-deploy after every train but the deploy 
happens very quickly, a ms or 2 as I recall.


From: Digambar Bhat 
Reply: user@predictionio.apache.org 
Date: June 11, 2018 at 9:38:15 AM
To: user@predictionio.apache.org 
Subject:  Re: Regarding Real-Time Prediction  

You don't need to deploy same engine again and again. You just deploy once and 
train whenever you want. Deployed instance will automatically point to newly 
trained model as hot swap happens. 

Regards,
Digambar

On Mon 11 Jun, 2018, 10:02 PM KRISH MEHTA,  wrote:
Hi,
I have just started using PredictionIO and according to the documentation I 
have to always run the Train and Deploy Command to get the prediction. I am 
working on predicting videos for recommendation and I want to know if there is 
any other way possible so that I can predict the results on the Fly with no 
Downtime.

Please help me with the same.

Yours Sincerely,
Krish

Re: UR template minimum event number to recommend

2018-06-04 Thread Pat Ferrel
No but we have 2 ways to handle this situation automatically and you can
tell if recommendations are not from personal user history.


   1. when there is not enough user history to recommend, we fill in the
   lower ranking recommendations with popular, trending, or hot items. Not
   completely irrelevant but certainly not as good as if we had more data for
   them.
   2. You can also mix item and user-based recs. So if you have an item,
   perhaps from the page or screen the user is looking at, you can send both
   user and item in the query. If you want user-based, boost it higher with
   the userBias. Then is the query cannot send back user-based it will fill in
   with item-based. This only works in certain situations where you have some
   example item.

As always if you do a user-based query and all scores are 0, you know that
no real recommendations are included and can take some other action.


From: Krajcs Ádám  
Reply: user@predictionio.apache.org 

Date: June 4, 2018 at 5:14:33 AM
To: user@predictionio.apache.org 

Subject:  UR template minimum event number to recommend

Hi,



Is it possible to configure somehow the universal recommender to recommend
items to user with minimum number of event? For example the user with 2
view events usually get unrelevant recommendations, but 5 events would be
enough.



Thanks!



Regads,

Adam Krajcs


Re: PIO 0.12.1 with HDP Spark on YARN

2018-05-29 Thread Pat Ferrel
Yarn has to be started explicitly. Usually it is part of Hadoop and is
started with Hadoop. Spark only contains the client for Yarn (afaik).



From: Miller, Clifford 

Reply: user@predictionio.apache.org 

Date: May 29, 2018 at 6:45:43 PM
To: user@predictionio.apache.org 

Subject:  Re: PIO 0.12.1 with HDP Spark on YARN

That's the command that I'm using but it gives me the exception that I
listed in the previous email.  I've installed a Spark standalone cluster
and am using that for training for now but would like to use Spark on YARN
eventually.

Are you using HDP? If so, what version of HDP are you using?  I'm using
*HDP-2.6.2.14.*



On Tue, May 29, 2018 at 8:55 PM, suyash kharade 
wrote:

> I use 'pio train -- --master yarn'
> It works for me to train universal recommender
>
> On Tue, May 29, 2018 at 8:31 PM, Miller, Clifford <
> clifford.mil...@phoenix-opsgroup.com> wrote:
>
>> To add more details to this.  When I attempt to execute my training job
>> using the command 'pio train -- --master yarn' I get the exception that
>> I've included below.  Can anyone tell me how to correctly submit the
>> training job or what setting I need to change to make this work.  I've made
>> not custom code changes and am simply using PIO 0.12.1 with the
>> SimilarProduct Recommender.
>>
>>
>>
>> [ERROR] [SparkContext] Error initializing SparkContext.
>> [INFO] [ServerConnector] Stopped Spark@1f992a3a{HTTP/1.1}{0.0.0.0:4040}
>> [WARN] [YarnSchedulerBackend$YarnSchedulerEndpoint] Attempted to request
>> executors before the AM has registered!
>> [WARN] [MetricsSystem] Stopping a MetricsSystem that is not running
>> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
>> at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$se
>> tEnvFromInputString$1.apply(YarnSparkHadoopUtil.scala:154)
>> at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$se
>> tEnvFromInputString$1.apply(YarnSparkHadoopUtil.scala:152)
>> at scala.collection.IndexedSeqOptimized$class.foreach(
>> IndexedSeqOptimized.scala:33)
>> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.
>> scala:186)
>> at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.setEnvFrom
>> InputString(YarnSparkHadoopUtil.scala:152)
>> at org.apache.spark.deploy.yarn.Client$$anonfun$setupLaunchEnv$
>> 6.apply(Client.scala:819)
>> at org.apache.spark.deploy.yarn.Client$$anonfun$setupLaunchEnv$
>> 6.apply(Client.scala:817)
>> at scala.Option.foreach(Option.scala:257)
>> at org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.sc
>> ala:817)
>> at org.apache.spark.deploy.yarn.Client.createContainerLaunchCon
>> text(Client.scala:911)
>> at org.apache.spark.deploy.yarn.Client.submitApplication(Client
>> .scala:172)
>> at org.apache.spark.scheduler.cluster.YarnClientSchedulerBacken
>> d.start(YarnClientSchedulerBackend.scala:56)
>> at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSched
>> ulerImpl.scala:156)
>> at org.apache.spark.SparkContext.(SparkContext.scala:509)
>> at org.apache.predictionio.workflow.WorkflowContext$.apply(
>> WorkflowContext.scala:45)
>> at org.apache.predictionio.workflow.CoreWorkflow$.runTrain(
>> CoreWorkflow.scala:59)
>> at org.apache.predictionio.workflow.CreateWorkflow$.main(Create
>> Workflow.scala:251)
>> at org.apache.predictionio.workflow.CreateWorkflow.main(CreateW
>> orkflow.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>> ssorImpl.java:62)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>> thodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy
>> $SparkSubmit$$runMain(SparkSubmit.scala:751)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit
>> .scala:187)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.
>> scala:212)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:
>> 126)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>>
>>
>>
>> On Tue, May 29, 2018 at 12:01 AM, Miller, Clifford <
>> clifford.mil...@phoenix-opsgroup.com> wrote:
>>
>>> So updating the version in the RELEASE file to 2.1.1 fixed the version
>>> detection problem but I'm still not able to submit Spark jobs unless they
>>> are strictly local.  How are you submitting to the HDP Spark?
>>>
>>> Thanks,
>>>
>>> --Cliff.
>>>
>>>
>>>
>>> On Mon, May 28, 2018 at 1:12 AM, suyash kharade <
>>> suyash.khar...@gmail.com> wrote:
>>>
 Hi Miller,
 I faced same issue.
 It is giving error as release file has '-' in version
 Insert simple version in release file something like 2.6.

 On Mon, May 28, 2018 at 4:32 AM, Miller, 

Re: Spark cluster error

2018-05-29 Thread Pat Ferrel
Yes, the spark-submit --jars is where we started to find the missing class.
The class isn’t found on the remote executor so we looked in the jars
actually downloaded into the executor’s work dir. the PIO assembly jars are
there are do have the classes. This would be in the classpath of the
executor, right? Not sure what you are asking.

Are you asking about the SPARK_CLASSPATH in spark-env.sh? The default
should include the work subdir for the job, I believe. and it can only be
added to so we couldn’t have messed that up if it points first to the
work/job-number dir, right?

I guess the root of my question is how can the jars be downloaded to the
executor’s work dir and still the classes we know are in the jar are not
found?


From: Donald Szeto  
Reply: user@predictionio.apache.org 

Date: May 29, 2018 at 1:27:03 PM
To: user@predictionio.apache.org 

Subject:  Re: Spark cluster error

Sorry, what I meant was the actual spark-submit command that PIO was using.
It should be in the log.

What Spark version was that? I recall classpath issues with certain
versions of Spark.

On Thu, May 24, 2018 at 4:52 PM, Pat Ferrel  wrote:

> Thanks Donald,
>
> We have:
>
>- built pio with hbase 1.4.3, which is what we have deployed
>- verified that the `ProtobufUtil` class is in the pio hbase assembly
>- verified the assembly is passed in --jars to spark-submit
>- verified that the executors receive and store the assemblies in the
>FS work dir on the worker machines
>- verified that hashes match the original assembly so the class is
>being received by every executor
>
> However the executor is unable to find the class.
>
> This seems just short of impossible but clearly possible. How can the
> executor deserialize the code but not find it later?
>
> Not sure what you mean the classpath going in to the cluster? The classDef
> not found does seem to be in the pio 0.12.1 hbase assembly, isn’t this
> where it should get it?
>
> Thanks again
> p
>
>
> From: Donald Szeto  
> Reply: user@predictionio.apache.org 
> 
> Date: May 24, 2018 at 2:10:24 PM
> To: user@predictionio.apache.org 
> 
> Subject:  Re: Spark cluster error
>
> 0.12.1 packages HBase 0.98.5-hadoop2 in the storage driver assembly.
> Looking at Git history it has not changed in a while.
>
> Do you have the exact classpath that has gone into your Spark cluster?
>
> On Wed, May 23, 2018 at 1:30 PM, Pat Ferrel  wrote:
>
>> A source build did not fix the problem, has anyone run PIO 0.12.1 on a
>> Spark cluster? The issue seems to be how to pass the correct code to Spark
>> to connect to HBase:
>>
>> [ERROR] [TransportRequestHandler] Error while invoking
>> RpcHandler#receive() for one-way message.
>> [ERROR] [TransportRequestHandler] Error while invoking
>> RpcHandler#receive() for one-way message.
>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
>> due to stage failure: Task 4 in stage 0.0 failed 4 times, most recent
>> failure: Lost task 4.3 in stage 0.0 (TID 18, 10.68.9.147, executor 0):
>> java.lang.NoClassDefFoundError: Could not initialize class
>> org.apache.hadoop.hbase.protobuf.ProtobufUtil
>> at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convert
>> StringToScan(TableMapReduceUtil.java:521)
>> at org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(
>> TableInputFormat.java:110)
>> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRD
>> D.scala:170)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)```
>> (edited)
>>
>> Now that we have these pluggable DBs did I miss something? This works
>> with master=local but not with remote Spark master
>>
>> I’ve passed in the hbase-client in the --jars part of spark-submit, still
>> fails, what am I missing?
>>
>>
>> From: Pat Ferrel  
>> Reply: Pat Ferrel  
>> Date: May 23, 2018 at 8:57:32 AM
>> To: user@predictionio.apache.org 
>> 
>> Subject:  Spark cluster error
>>
>> Same CLI works using local Spark master, but fails using remote master
>> for a cluster due to a missing class def for protobuf used in hbase. We are
>> using the binary dist 0.12.1.  Is this known? Is there a work around?
>>
>> We are now trying a source build in hope the class will be put in the
>> assembly passed to Spark and the reasoning is that the executors don’t
>> contain hbase classes but when you run a local executor it does, due to
>> some local classpath. If the source built assembly does not have these
>> classes, we will have the same problem. Namely how to get protobuf to the
>> executors.
>>
>> Has anyone seen this?
>>
>>
>


Re: pio app new failed in hbase

2018-05-29 Thread Pat Ferrel
No, this is as expected. When you run pseudo-distributed everything
internally is configured as if the services were on separate machines. See
clustered instructions here: http://actionml.com/docs/small_ha_cluster This
is to setup 3 machines running different parts and is not really the best
physical architecture but does illustrate how a distributed setup would go.

BTW we (ActionML) use containers now to do this setup but it still works.
The smallest distributed cluster that makes sense for the Universal
Recommender is 5 machines. 2 dedicated to Spark, which can be started and
stopped around the `pio train` process. So 3 are permanent; one for PIO
servers (EventServer and PredictionServer) one for HDFS+HBase, one for
Elasticsearch. This allows you to vertically scale by increasing the size
of the service instances in-place (easy with AWS), then horizontally scale
HBase or Elasticsearch, or Spark independently if vertical scaling is not
sufficient. You can also combine the 2 Spark instances as long as you
remember that the `pio train` process creates a Spark Driver on the machine
the process is launched on and so the driver may need to be nearly as
powerful as a Spark Executor. The Spark Driver is an “invisible" and
therefore often overlooked member of the Spark cluster. It is often but not
always smaller than the executors, to put it on the PIO servers machine is
therefore dangerous in terms of scaling unless you know the resources it
will need. Using Yarn can but the Driver on the cluster (off the launching
machine) but is more complex than the default Spark “standalone” config.

The Universal Recommender is the exception here because it does not require
a big non-local Spark for anything but training, so we move the `pio train`
process to a Spark “Driver” machine that is ephemeral as the Spark
Executor(s) is(are). Other templates may require Spark in train and deploy.
Once the UR’s training is done it will automatically swap in the new model
so the running deployed PredictionServer will automatically start using
it—no re-deploy needed.


From: Marco Goldin  
Reply: user@predictionio.apache.org 

Date: May 29, 2018 at 6:38:21 AM
To: user@predictionio.apache.org 

Subject:  Re: pio app new failed in hbase

i was able to solve the issue deleting hbase folder in hdfs with "hdfs dfs
-rm -r /hbase" and restarting hbase.
now app creation in pio is working again.

I still wonder why this problem happen though, i'm running hbase in
pseudo-distributed mode (for testing purposes everything, from spark to
hadoop, is in a single machine), could a problem for prediction in managing
the apps?

2018-05-29 13:47 GMT+02:00 Marco Goldin :

> Hi all, i deleted all old apps from prediction (currently running 0.12.0)
> but when i'm creating a new one i get this error from hbase.
> I inspected hbase from shell but there aren't any table inside.
>
>
> ```
>
> pio app new mlolur
>
> [INFO] [HBLEvents] The table pio_event:events_1 doesn't exist yet.
> Creating now...
>
> Exception in thread "main" org.apache.hadoop.hbase.TableExistsException:
> org.apache.hadoop.hbase.TableExistsException: pio_event:events_1
>
> at org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.
> prepareCreate(CreateTableProcedure.java:299)
>
> at org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.
> executeFromState(CreateTableProcedure.java:106)
>
> at org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.
> executeFromState(CreateTableProcedure.java:58)
>
> at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(
> StateMachineProcedure.java:119)
>
> at org.apache.hadoop.hbase.procedure2.Procedure.
> doExecute(Procedure.java:498)
>
> at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(
> ProcedureExecutor.java:1147)
>
> at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.
> execLoop(ProcedureExecutor.java:942)
>
> at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.
> execLoop(ProcedureExecutor.java:895)
>
> at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.
> access$400(ProcedureExecutor.java:77)
>
> at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$
> 2.run(ProcedureExecutor.java:497)
>
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:62)
>
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
>
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>
> at org.apache.hadoop.ipc.RemoteException.instantiateException(
> RemoteException.java:106)
>
> at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(
> RemoteException.java:95)
>
> at org.apache.hadoop.hbase.client.RpcRetryingCaller.translateException(
> RpcRetryingCaller.java:209)
>
> at org.apache.hadoop.hbase.client.RpcRetryingCaller.translateException(
> RpcRetryingCaller.java:223)
>
> at 

Re: PIO not using HBase cluster

2018-05-25 Thread Pat Ferrel
How are you starting the EventServer? You should not use pio-start-all
which assumes all services are local

configurre pio-env.sh with your remote hbase
start es with `pio eventserver &` or some method where it won’t kill the es
when you log off like `nohup pio eventserver &`
this should not start a local hbase so you should have your remote one
running
Same for the remote Elasticsearch and HDFS, they should be in pio-env.sh
and already started
pio status should be fine with the remote HBase


From: Miller, Clifford <clifford.mil...@phoenix-opsgroup.com>
<clifford.mil...@phoenix-opsgroup.com>
Reply: Miller, Clifford <clifford.mil...@phoenix-opsgroup.com>
<clifford.mil...@phoenix-opsgroup.com>
Date: May 25, 2018 at 10:16:01 AM
To: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>
Cc: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Subject:  Re: PIO not using HBase cluster

I'll keep you informed.  However, I'm having issues getting past this.  If
I have hbase installed with the clusters config files then it still does
not communicate with the cluster.  It does start hbase but on the local PIO
server.  If I ONLY have the hbase config (which worked in version 0.10.0)
then pio-start-all gives the following message.


 pio-start-all
Starting Elasticsearch...
Starting HBase...
/home/centos/PredictionIO-0.12.1/bin/pio-start-all: line 65:
/home/centos/PredictionIO-0.12.1/vendors/hbase/bin/start-hbase.sh: No such
file or directory
Waiting 10 seconds for Storage Repositories to fully initialize...
Starting PredictionIO Event Server...


"pio status" then returns:


 pio status
[INFO] [Management$] Inspecting PredictionIO...
[INFO] [Management$] PredictionIO 0.12.1 is installed at
/home/centos/PredictionIO-0.12.1
[INFO] [Management$] Inspecting Apache Spark...
[INFO] [Management$] Apache Spark is installed at
/home/centos/PredictionIO-0.12.1/vendors/spark
[INFO] [Management$] Apache Spark 2.1.1 detected (meets minimum requirement
of 1.3.0)
[INFO] [Management$] Inspecting storage backend connections...
[INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)...
[INFO] [Storage$] Verifying Model Data Backend (Source: HDFS)...
[WARN] [DomainSocketFactory] The short-circuit local reads feature cannot
be used because libhadoop cannot be loaded.
[INFO] [Storage$] Verifying Event Data Backend (Source: HBASE)...
[ERROR] [RecoverableZooKeeper] ZooKeeper exists failed after 1 attempts
[ERROR] [ZooKeeperWatcher] hconnection-0x558756be, quorum=localhost:2181,
baseZNode=/hbase Received unexpected KeeperException, re-throwing exception
[WARN] [ZooKeeperRegistry] Can't retrieve clusterId from Zookeeper
[ERROR] [StorageClient] Cannot connect to ZooKeeper (ZooKeeper ensemble:
localhost). Please make sure that the configuration is pointing at the
correct ZooKeeper ensemble. By default, HBase manages its own ZooKeeper, so
if you have not configured HBase to use an external ZooKeeper, that means
your HBase is not started or configured properly.
[ERROR] [Storage$] Error initializing storage client for source HBASE.
org.apache.hadoop.hbase.ZooKeeperConnectionException: Can't connect to
ZooKeeper
at
org.apache.hadoop.hbase.client.HBaseAdmin.checkHBaseAvailable(HBaseAdmin.java:2358)
at
org.apache.predictionio.data.storage.hbase.StorageClient.(StorageClient.scala:53)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.predictionio.data.storage.Storage$.getClient(Storage.scala:252)
at
org.apache.predictionio.data.storage.Storage$.org$apache$predictionio$data$storage$Storage$$updateS2CM(Storage.scala:283)
at
org.apache.predictionio.data.storage.Storage$$anonfun$sourcesToClientMeta$1.apply(Storage.scala:244)
at
org.apache.predictionio.data.storage.Storage$$anonfun$sourcesToClientMeta$1.apply(Storage.scala:244)
at
scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:194)
at
scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:80)
at
org.apache.predictionio.data.storage.Storage$.sourcesToClientMeta(Storage.scala:244)
at
org.apache.predictionio.data.storage.Storage$.getDataObject(Storage.scala:315)
at
org.apache.predictionio.data.storage.Storage$.getDataObjectFromRepo(Storage.scala:300)
at
org.apache.predictionio.data.storage.Storage$.getLEvents(Storage.scala:448)
at
org.apache.predictionio.data.storage.Storage$.verifyAllDataObjects(Storage.scala:384)
at
org.apache.predictionio.tools.commands.Management$.status(Management.scala

Re: PIO not using HBase cluster

2018-05-25 Thread Pat Ferrel
No, you need to have HBase installed, or at least the config installed on
the PIO machine. The pio-env.sh defined servers will be  configured cluster
operations and will be started separately from PIO. PIO then will not start
hbase and try to sommunicate only, not start it. But PIO still needs config
for the client code that is in the pio assembly jar.

Some services were not cleanly separated between client, master, and slave
so complete installation is easiest though you can figure out the minimum
with experimentation and I think it is just the conf directory.

BTW we have a similar setup and are having trouble with the Spark training
phase getting a `classDefNotFound: org.apache.hadoop.hbase.ProtobufUtil` so
can you let us know how it goes?



From: Miller, Clifford 

Reply: user@predictionio.apache.org 

Date: May 25, 2018 at 9:43:46 AM
To: user@predictionio.apache.org 

Subject:  PIO not using HBase cluster

I'm attempting to use a remote cluster with PIO 0.12.1.  When I run
pio-start-all it starts the hbase locally and does not use the remote
cluster as configured.  I've copied the HBase and Hadoop conf files from
the cluster and put them into the locally configured directories.  I set
this up in the past using a similar configuration but was using PIO
0.10.0.  When doing this with this version I could start pio with only the
hbase and hadoop conf present.  This does not seem to be the case any
longer.

If I only put the cluster configs then it complains that it cannot find
start-hbase.sh.  If I put a hbase installation with cluster configs then it
will start a local hbase and not use the remote cluster.

Below is my PIO configuration



#!/usr/bin/env bash
#
# Safe config that will work if you expand your cluster later
SPARK_HOME=$PIO_HOME/vendors/spark
ES_CONF_DIR=$PIO_HOME/vendors/elasticsearch
HADOOP_CONF_DIR=$PIO_HOME/vendors/hadoop/conf
HBASE_CONF_DIR==$PIO_HOME/vendors/hbase/conf


# Filesystem paths where PredictionIO uses as block storage.
PIO_FS_BASEDIR=$HOME/.pio_store
PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines
PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp

# PredictionIO Storage Configuration
PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH

PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE

# Need to use HDFS here instead of LOCALFS to enable deploying to
# machines without the local model
PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=HDFS

# What store to use for what data
# Elasticsearch Example
PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch
# The next line should match the ES cluster.name in ES config
PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=dsp_es_cluster

# For clustered Elasticsearch (use one host/port if not clustered)
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=ip-10-0-1-136.us-gov-west-1.compute.internal,ip-10-0-1-126.us-gov-west-1.compute.internal,ip-10-0-1-126.us-gov-west-1.compute.internal
#PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9300,9300,9300
#PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
# PIO 0.12.0+ uses the REST client for ES 5+ and this defaults to
# port 9200, change if appropriate but do not use the Transport Client port
# PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9200,9200,9200

PIO_STORAGE_SOURCES_HDFS_TYPE=hdfs
PIO_STORAGE_SOURCES_HDFS_PATH=hdfs://ip-10-0-1-138.us-gov-west-1.compute.internal:8020/models

# HBase Source config
PIO_STORAGE_SOURCES_HBASE_TYPE=hbase
PIO_STORAGE_SOURCES_HBASE_HOME=$PIO_HOME/vendors/hbase

# Hbase clustered config (use one host/port if not clustered)
PIO_STORAGE_SOURCES_HBASE_HOSTS=ip-10-0-1-138.us-gov-west-1.compute.internal,ip-10-0-1-209.us-gov-west-1.compute.internal,ip-10-0-1-79.us-gov-west-1.compute.internal
~


Re: Spark2 with YARN

2018-05-24 Thread Pat Ferrel
I’m having a java.lang.NoClassDefFoundError in a different context and
different class. Have you tried this without Yarn? Sorry I can’t find the
rest of this thread.


From: Miller, Clifford 

Reply: user@predictionio.apache.org 

Date: May 24, 2018 at 4:16:58 PM
To: user@predictionio.apache.org 

Subject:  Spark2 with YARN

I've setup a cluster using Hortonworks HDP with Ambari all running in AWS.
I then created a separate EC2 instance and installed PIO 0.12.1, hadoop,
elasticsearch, hbase, and spark2.  I copied the configurations from the HDP
cluster and then pio-start-all.  The pio-start-all completes successfully
and running "pio status" also shows success.  I'm following the "Text
Classification Engine Tutorial".  I've imported the data.  I'm using the
following command to train: "pio train -- --master yarn".  After running
the command I get the following exception.  Does anyone have any ideas of
what I may have missed during my setup?

Thanks in advance.

#
Exception follows:

Exception in thread "main" java.lang.NoClassDefFoundError:
com/sun/jersey/api/client/config/ClientConfig
at
org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
at
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:152)
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)
at org.apache.spark.SparkContext.(SparkContext.scala:509)
at
org.apache.predictionio.workflow.WorkflowContext$.apply(WorkflowContext.scala:45)
at
org.apache.predictionio.workflow.CoreWorkflow$.runTrain(CoreWorkflow.scala:59)
at
org.apache.predictionio.workflow.CreateWorkflow$.main(CreateWorkflow.scala:251)
at
org.apache.predictionio.workflow.CreateWorkflow.main(CreateWorkflow.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException:
com.sun.jersey.api.client.config.ClientConfig
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 20 more

##


Re: Spark cluster error

2018-05-24 Thread Pat Ferrel
Thanks Donald,

We have:

   - built pio with hbase 1.4.3, which is what we have deployed
   - verified that the `ProtobufUtil` class is in the pio hbase assembly
   - verified the assembly is passed in --jars to spark-submit
   - verified that the executors receive and store the assemblies in the FS
   work dir on the worker machines
   - verified that hashes match the original assembly so the class is being
   received by every executor

However the executor is unable to find the class.

This seems just short of impossible but clearly possible. How can the
executor deserialize the code but not find it later?

Not sure what you mean the classpath going in to the cluster? The classDef
not found does seem to be in the pio 0.12.1 hbase assembly, isn’t this
where it should get it?

Thanks again
p


From: Donald Szeto <don...@apache.org> <don...@apache.org>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Date: May 24, 2018 at 2:10:24 PM
To: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Subject:  Re: Spark cluster error

0.12.1 packages HBase 0.98.5-hadoop2 in the storage driver assembly.
Looking at Git history it has not changed in a while.

Do you have the exact classpath that has gone into your Spark cluster?

On Wed, May 23, 2018 at 1:30 PM, Pat Ferrel <p...@actionml.com> wrote:

> A source build did not fix the problem, has anyone run PIO 0.12.1 on a
> Spark cluster? The issue seems to be how to pass the correct code to Spark
> to connect to HBase:
>
> [ERROR] [TransportRequestHandler] Error while invoking
> RpcHandler#receive() for one-way message.
> [ERROR] [TransportRequestHandler] Error while invoking
> RpcHandler#receive() for one-way message.
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 4 in stage 0.0 failed 4 times, most recent
> failure: Lost task 4.3 in stage 0.0 (TID 18, 10.68.9.147, executor 0):
> java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.hadoop.hbase.protobuf.ProtobufUtil
> at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.
> convertStringToScan(TableMapReduceUtil.java:521)
> at org.apache.hadoop.hbase.mapreduce.TableInputFormat.
> setConf(TableInputFormat.java:110)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(
> NewHadoopRDD.scala:170)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)```
> (edited)
>
> Now that we have these pluggable DBs did I miss something? This works with
> master=local but not with remote Spark master
>
> I’ve passed in the hbase-client in the --jars part of spark-submit, still
> fails, what am I missing?
>
>
> From: Pat Ferrel <p...@actionml.com> <p...@actionml.com>
> Reply: Pat Ferrel <p...@actionml.com> <p...@actionml.com>
> Date: May 23, 2018 at 8:57:32 AM
> To: user@predictionio.apache.org <user@predictionio.apache.org>
> <user@predictionio.apache.org>
> Subject:  Spark cluster error
>
> Same CLI works using local Spark master, but fails using remote master for
> a cluster due to a missing class def for protobuf used in hbase. We are
> using the binary dist 0.12.1.  Is this known? Is there a work around?
>
> We are now trying a source build in hope the class will be put in the
> assembly passed to Spark and the reasoning is that the executors don’t
> contain hbase classes but when you run a local executor it does, due to
> some local classpath. If the source built assembly does not have these
> classes, we will have the same problem. Namely how to get protobuf to the
> executors.
>
> Has anyone seen this?
>
>


Re: Spark cluster error

2018-05-23 Thread Pat Ferrel
A source build did not fix the problem, has anyone run PIO 0.12.1 on a
Spark cluster? The issue seems to be how to pass the correct code to Spark
to connect to HBase:

[ERROR] [TransportRequestHandler] Error while invoking RpcHandler#receive()
for one-way message.
[ERROR] [TransportRequestHandler] Error while invoking RpcHandler#receive()
for one-way message.
Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 4 in stage 0.0 failed 4 times, most recent failure:
Lost task 4.3 in stage 0.0 (TID 18, 10.68.9.147, executor 0):
java.lang.NoClassDefFoundError: Could not initialize class
org.apache.hadoop.hbase.protobuf.ProtobufUtil
at
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertStringToScan(TableMapReduceUtil.java:521)
at
org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:110)
at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:170)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)```
(edited)

Now that we have these pluggable DBs did I miss something? This works with
master=local but not with remote Spark master

I’ve passed in the hbase-client in the --jars part of spark-submit, still
fails, what am I missing?


From: Pat Ferrel <p...@actionml.com> <p...@actionml.com>
Reply: Pat Ferrel <p...@actionml.com> <p...@actionml.com>
Date: May 23, 2018 at 8:57:32 AM
To: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Subject:  Spark cluster error

Same CLI works using local Spark master, but fails using remote master for
a cluster due to a missing class def for protobuf used in hbase. We are
using the binary dist 0.12.1.  Is this known? Is there a work around?

We are now trying a source build in hope the class will be put in the
assembly passed to Spark and the reasoning is that the executors don’t
contain hbase classes but when you run a local executor it does, due to
some local classpath. If the source built assembly does not have these
classes, we will have the same problem. Namely how to get protobuf to the
executors.

Has anyone seen this?


Spark cluster error

2018-05-23 Thread Pat Ferrel
Same CLI works using local Spark master, but fails using remote master for
a cluster due to a missing class def for protobuf used in hbase. We are
using the binary dist 0.12.1.  Is this known? Is there a work around?

We are now trying a source build in hope the class will be put in the
assembly passed to Spark and the reasoning is that the executors don’t
contain hbase classes but when you run a local executor it does, due to
some local classpath. If the source built assembly does not have these
classes, we will have the same problem. Namely how to get protobuf to the
executors.

Has anyone seen this?


RE: Problem with training in yarn cluster

2018-05-23 Thread Pat Ferrel

at 
org.apache.predictionio.data.storage.Storage$.sourcesToClientMeta(Storage.scala:244)

at 
org.apache.predictionio.data.storage.Storage$.getDataObject(Storage.scala:315)

at 
org.apache.predictionio.data.storage.Storage$.getPDataObject(Storage.scala:364)

at 
org.apache.predictionio.data.storage.Storage$.getPDataObject(Storage.scala:307)

at 
org.apache.predictionio.data.storage.Storage$.getPEvents(Storage.scala:454)

at 
org.apache.predictionio.data.store.PEventStore$.eventsDb$lzycompute(PEventStore.scala:37)

at 
org.apache.predictionio.data.store.PEventStore$.eventsDb(PEventStore.scala:37)

at 
org.apache.predictionio.data.store.PEventStore$.find(PEventStore.scala:73)

at com.actionml.DataSource.readTraining(DataSource.scala:76)

at com.actionml.DataSource.readTraining(DataSource.scala:48)

at 
org.apache.predictionio.controller.PDataSource.readTrainingBase(PDataSource.scala:40)

at org.apache.predictionio.controller.Engine$.train(Engine.scala:642)

at org.apache.predictionio.controller.Engine.train(Engine.scala:176)

at 
org.apache.predictionio.workflow.CoreWorkflow$.runTrain(CoreWorkflow.scala:67)

at 
org.apache.predictionio.workflow.CreateWorkflow$.main(CreateWorkflow.scala:251)

at 
org.apache.predictionio.workflow.CreateWorkflow.main(CreateWorkflow.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)

Caused by: com.google.protobuf.ServiceException:
java.net.UnknownHostException: unknown host: hbase-master

at 
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1678)

at 
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1719)

at 
org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$BlockingStub.isMasterRunning(MasterProtos.java:42561)

at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$MasterServiceStubMaker.isMasterRunning(HConnectionManager.java:1682)

at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$StubMaker.makeStubNoRetries(HConnectionManager.java:1591)

at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$StubMaker.makeStub(HConnectionManager.java:1617)

... 36 more

Caused by: java.net.UnknownHostException: unknown host: hbase-master

at 
org.apache.hadoop.hbase.ipc.RpcClient$Connection.(RpcClient.java:385)

at 
org.apache.hadoop.hbase.ipc.RpcClient.createConnection(RpcClient.java:351)

at 
org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClient.java:1530)

at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1442)

at 
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1661)

... 41 more







*From: *Ambuj Sharma <am...@getamplify.com>
*Sent: *23 May 2018 08:59
*To: *user@predictionio.apache.org
*Cc: *Wojciech Kowalski <wojci...@tomandco.co.uk>
*Subject: *Re: Problem with training in yarn cluster



Hi wojciech,

 I also faced many problems while setting yarn with PredictionIO. This may
be the case where yarn is tyring to findout pio.log file on hdfs cluster.
You can try "--master yarn --deploy-mode client ". you need to pass this
configuration with pio train

e.g., pio train -- --master yarn --deploy-mode client








Thanks and Regards

Ambuj Sharma

Sunrise may late, But Morning is sure.

Team ML

Betaout



On Wed, May 23, 2018 at 4:53 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

Actually you might search the archives for “yarn” because I don’t recall
how the setup works off hand.



Archives here:
https://lists.apache.org/list.html?user@predictionio.apache.org



Also check the Spark Yarn requirements and remember that `pio train … --
various Spark params` allows you to pass arbitrary Spark params exactly as
you would to spark-submit on the pio command line. The double dash
separates PIO and Spark params.




From: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Date: May 22, 2018 at 4:07:38 PM
To: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>, Wojciech Kowalski <wojci...@tomandco.co.uk>
<wojci...@tomandco.co.uk>


Subject:  RE: Problem with training in yarn cluster



What is the command line for `pio train …` Specifically are you using

RE: Problem with training in yarn cluster

2018-05-22 Thread Pat Ferrel
Actually you might search the archives for “yarn” because I don’t recall
how the setup works off hand.

Archives here:
https://lists.apache.org/list.html?user@predictionio.apache.org

Also check the Spark Yarn requirements and remember that `pio train … --
various Spark params` allows you to pass arbitrary Spark params exactly as
you would to spark-submit on the pio command line. The double dash
separates PIO and Spark params.


From: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Date: May 22, 2018 at 4:07:38 PM
To: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>, Wojciech Kowalski <wojci...@tomandco.co.uk>
<wojci...@tomandco.co.uk>
Subject:  RE: Problem with training in yarn cluster

What is the command line for `pio train …` Specifically are you using
yarn-cluster mode? This causes the driver code, which is a PIO process, to
be executed on an executor. Special setup is required for this.


From: Wojciech Kowalski <wojci...@tomandco.co.uk> <wojci...@tomandco.co.uk>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Date: May 22, 2018 at 2:28:43 PM
To: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Subject:  RE: Problem with training in yarn cluster

Hello,



Actually I have another error in logs that is actually preventing train as
well:



[INFO] [RecommendationEngine$]



   _   _ __  __ _

 /\   | | (_)   |  \/  | |

/  \   ___| |_ _  ___  _ __ | \  / | |

   / /\ \ / __| __| |/ _ \| '_ \| |\/| | |

  /  \ (__| |_| | (_) | | | | |  | | |

 /_/\_\___|\__|_|\___/|_| |_|_|  |_|__|







[INFO] [Engine] Extracting datasource params...

[INFO] [WorkflowUtils$] No 'name' is found. Default empty String will be used.

[INFO] [Engine] Datasource params:
(,DataSourceParams(shop_live,List(purchase, basket-add, wishlist-add,
view),None,None))

[INFO] [Engine] Extracting preparator params...

[INFO] [Engine] Preparator params: (,Empty)

[INFO] [Engine] Extracting serving params...

[INFO] [Engine] Serving params: (,Empty)

[INFO] [log] Logging initialized @6774ms

[INFO] [Server] jetty-9.2.z-SNAPSHOT

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@1798eb08{/jobs,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@47c4c3cd{/jobs/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@3e080dea{/jobs/job,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@c75847b{/jobs/job/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@5ce5ee56{/stages,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@3dde94ac{/stages/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@4347b9a0{/stages/stage,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@63b1bbef{/stages/stage/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@10556e91{/stages/pool,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@5967f3c3{/stages/pool/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@2793dbf6{/storage,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@49936228{/storage/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@7289bc6d{/storage/rdd,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@1496b014{/storage/rdd/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@2de3951b{/environment,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@7f3330ad{/environment/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@40e681f2{/executors,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@61519fea{/executors/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@502b9596{/executors/threadDump,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@367b7166{/executors/threadDump/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@42669f4a{/static,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@2f25f623{/,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@23ae4174{/api,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@4e33e426{/jobs/job/kill,n

RE: Problem with training in yarn cluster

2018-05-22 Thread Pat Ferrel
What is the command line for `pio train …` Specifically are you using 
yarn-cluster mode? This causes the driver code, which is a PIO process, to be 
executed on an executor. Special setup is required for this.


From: Wojciech Kowalski 
Reply: user@predictionio.apache.org 
Date: May 22, 2018 at 2:28:43 PM
To: user@predictionio.apache.org 
Subject:  RE: Problem with training in yarn cluster  

Hello,

 

Actually I have another error in logs that is actually preventing train as well:

 

[INFO] [RecommendationEngine$]  
 
   _   _ __  __ _
 /\   | | (_)   |  \/  | |
    /  \   ___| |_ _  ___  _ __ | \  / | |
   / /\ \ / __| __| |/ _ \| '_ \| |\/| | |
  /  \ (__| |_| | (_) | | | | |  | | |
 /_/    \_\___|\__|_|\___/|_| |_|_|  |_|__|
 
 
   
[INFO] [Engine] Extracting datasource params...
[INFO] [WorkflowUtils$] No 'name' is found. Default empty String will be used.
[INFO] [Engine] Datasource params: (,DataSourceParams(shop_live,List(purchase, 
basket-add, wishlist-add, view),None,None))
[INFO] [Engine] Extracting preparator params...
[INFO] [Engine] Preparator params: (,Empty)
[INFO] [Engine] Extracting serving params...
[INFO] [Engine] Serving params: (,Empty)
[INFO] [log] Logging initialized @6774ms
[INFO] [Server] jetty-9.2.z-SNAPSHOT
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@1798eb08{/jobs,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@47c4c3cd{/jobs/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@3e080dea{/jobs/job,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@c75847b{/jobs/job/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@5ce5ee56{/stages,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@3dde94ac{/stages/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@4347b9a0{/stages/stage,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@63b1bbef{/stages/stage/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@10556e91{/stages/pool,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@5967f3c3{/stages/pool/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@2793dbf6{/storage,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@49936228{/storage/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@7289bc6d{/storage/rdd,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@1496b014{/storage/rdd/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@2de3951b{/environment,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@7f3330ad{/environment/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@40e681f2{/executors,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@61519fea{/executors/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@502b9596{/executors/threadDump,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@367b7166{/executors/threadDump/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@42669f4a{/static,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@2f25f623{/,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@23ae4174{/api,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@4e33e426{/jobs/job/kill,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@38d9ae65{/stages/stage/kill,null,AVAILABLE,@Spark}
[INFO] [ServerConnector] Started Spark@17239b3{HTTP/1.1}{0.0.0.0:47948}
[INFO] [Server] Started @7040ms
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@16cffbe4{/metrics/json,null,AVAILABLE,@Spark}
[WARN] [YarnSchedulerBackend$YarnSchedulerEndpoint] Attempted to request 
executors before the AM has registered!
[ERROR] [ApplicationMaster] Uncaught exception:  
 

Thanks,

Wojciech

 

From: Wojciech Kowalski
Sent: 22 May 2018 23:20
To: user@predictionio.apache.org
Subject: Problem with training in yarn cluster

 

Hello, I am trying to setup distributed cluster with separate all services but 
i have problem while running train:

 

log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /pio/pio.log (No such file or directory)
    at java.io.FileOutputStream.open0(Native Method)
    at 

Re: UR: build/train/deploy once & querying for 3 use cases

2018-05-11 Thread Pat Ferrel
BTW The Universal Recommender has it’s own community support group here:
https://groups.google.com/forum/#!forum/actionml-user


From: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Date: May 11, 2018 at 10:07:25 AM
To: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>, Nasos Papageorgiou
<at.papageorg...@gmail.com> <at.papageorg...@gmail.com>
Subject:  Re: UR: build/train/deploy once & querying for 3 use cases

Yes but do you really care as a business about “users who viewed this also
viewed that”? I’d say no. You want to help them find what to buy and there
is a big difference between viewing and buying behavior. If you are only
interested in increasing time on site, or have ads shown that benefit from
more views then it might make more sense but a pure e-comm site would be
after sales.

The algorithm inside the UR can do all of these but only 1 and 2 are
possible with the current implementation. The Algorithm is call Correlated
Cross Occurrence and it can be targeted to recommend any recorded behavior.
On the theory that you would never want to throw away correlated behavior
in building models all behavior is taken into account so #1 could be
restated more precisely (but somewhat redundantly) as “people who viewed
(but then bought) this also viewed (and bought) these”. This targets what
you show people to “important” views. In fact if you are also using search
behavior and brand preferences it gets more wordy, “people who viewed this
(and bought, searched for, and preferred brands in a similar way) also
viewed” So you are showing viewed things that share the type of user like
the viewing user. You can just use one type of behavior, but why? Using all
makes the views more targeted.

So though it is possible to do 1-3 exactly as stated, you will get better
sales with the way described above.

Using my suggested method above #1 and #3 are the same.

   1. "eventNames”: [“view”, “buy”, “search”, “brand-pref”]
   2. "eventNames”: [ “buy”,“view”, “search”, “brand-pref”]
   3. "eventNames”: [“view”, “buy”, “search”, “brand-pref”]

If you want to do exactly as you have shown you’d have to throw out all
correlated cross-behavior.

   1. "eventNames”: [“view”]
   2. "eventNames”: [“buy”]
   3. "eventNames”: [“buy”, “view”] but then the internal model query would
   be only the current user’s view history. This is not supported in this
   exact form but could be added.

As you can see you are discarding a lot of valuable data if you insist on a
very pure interpretation of your 1-3 definitions, and I can promise you
that most knowledgable e-com sites do not mince words to finely.


From: Nasos Papageorgiou <at.papageorg...@gmail.com>
<at.papageorg...@gmail.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Date: May 11, 2018 at 12:39:27 AM
To: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Subject:  Re: UR: build/train/deploy once & querying for 3 use cases

Just a correction:  File on the first bullet is engine.json (not
events.json).

2018-05-10 17:01 GMT+03:00 Nasos Papageorgiou <at.papageorg...@gmail.com>:

>
>
> Hi all,
> to elaborate on these cases, the purpose is to create a UR for the cases
> of:
>
> 1.   “User who Viewed this item also Viewed”
>
> 2.   “User who Bought this item also Bought”
>
> 3.   “User who Viewed this item also Bought ”
>
> while having Events of Buying and Viewing a product.
> I would like to make some questions:
>
> 1.   On Data source Parameters, file: events.json: There is no matter
> on the sequence of the events which are defined. Right?
>
> 2.   If I specify one Event Type on the “eventNames” in Algorithm
> section (i.e. “view”)  and no event on the “blacklistEvents”,  is the
> second Event Type (i.e. “buy”) specified on the recommended list?
>
> 3.   If I use only "user" on the query, the "item case" will not be
> used for the recommendations. What is happening with the new users in
> that case?   Shall I use both "user" and "item" instead?
>
> 4.Values of less than 1 in “UserBias” and “ItemBias” on the query
> do not have any effect on the result.
>
> 5.    Is it feasible to build/train/deploy only once, and query for
> all 3 use cases?
>
>
> 6.   How to make queries towards the different Apps because there is
> no any obvious way in the query parameters or the URL?
>
> Thank you.
>
>
>
> *From:* Pat Ferrel [mailto:p...@occamsmachete.com]
> *Sent:* Wednesday, May 09, 2018 4:41 PM
> *To:* us

Re: UR: build/train/deploy once & querying for 3 use cases

2018-05-11 Thread Pat Ferrel
Yes but do you really care as a business about “users who viewed this also
viewed that”? I’d say no. You want to help them find what to buy and there
is a big difference between viewing and buying behavior. If you are only
interested in increasing time on site, or have ads shown that benefit from
more views then it might make more sense but a pure e-comm site would be
after sales.

The algorithm inside the UR can do all of these but only 1 and 2 are
possible with the current implementation. The Algorithm is call Correlated
Cross Occurrence and it can be targeted to recommend any recorded behavior.
On the theory that you would never want to throw away correlated behavior
in building models all behavior is taken into account so #1 could be
restated more precisely (but somewhat redundantly) as “people who viewed
(but then bought) this also viewed (and bought) these”. This targets what
you show people to “important” views. In fact if you are also using search
behavior and brand preferences it gets more wordy, “people who viewed this
(and bought, searched for, and preferred brands in a similar way) also
viewed” So you are showing viewed things that share the type of user like
the viewing user. You can just use one type of behavior, but why? Using all
makes the views more targeted.

So though it is possible to do 1-3 exactly as stated, you will get better
sales with the way described above.

Using my suggested method above #1 and #3 are the same.

   1. "eventNames”: [“view”, “buy”, “search”, “brand-pref”]
   2. "eventNames”: [ “buy”,“view”, “search”, “brand-pref”]
   3. "eventNames”: [“view”, “buy”, “search”, “brand-pref”]

If you want to do exactly as you have shown you’d have to throw out all
correlated cross-behavior.

   1. "eventNames”: [“view”]
   2. "eventNames”: [“buy”]
   3. "eventNames”: [“buy”, “view”] but then the internal model query would
   be only the current user’s view history. This is not supported in this
   exact form but could be added.

As you can see you are discarding a lot of valuable data if you insist on a
very pure interpretation of your 1-3 definitions, and I can promise you
that most knowledgable e-com sites do not mince words to finely.


From: Nasos Papageorgiou <at.papageorg...@gmail.com>
<at.papageorg...@gmail.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Date: May 11, 2018 at 12:39:27 AM
To: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Subject:  Re: UR: build/train/deploy once & querying for 3 use cases

Just a correction:  File on the first bullet is engine.json (not
events.json).

2018-05-10 17:01 GMT+03:00 Nasos Papageorgiou <at.papageorg...@gmail.com>:

>
>
> Hi all,
> to elaborate on these cases, the purpose is to create a UR for the cases
> of:
>
> 1.   “User who Viewed this item also Viewed”
>
> 2.   “User who Bought this item also Bought”
>
> 3.   “User who Viewed this item also Bought ”
>
> while having Events of Buying and Viewing a product.
> I would like to make some questions:
>
> 1.   On Data source Parameters, file: events.json: There is no matter
> on the sequence of the events which are defined. Right?
>
> 2.   If I specify one Event Type on the “eventNames” in Algorithm
> section (i.e. “view”)  and no event on the “blacklistEvents”,  is the
> second Event Type (i.e. “buy”) specified on the recommended list?
>
> 3.   If I use only "user" on the query, the "item case" will not be
> used for the recommendations. What is happening with the new users in
> that case?   Shall I use both "user" and "item" instead?
>
> 4.Values of less than 1 in “UserBias” and “ItemBias” on the query
> do not have any effect on the result.
>
> 5.Is it feasible to build/train/deploy only once, and query for
> all 3 use cases?
>
>
> 6.   How to make queries towards the different Apps because there is
> no any obvious way in the query parameters or the URL?
>
> Thank you.
>
>
>
> *From:* Pat Ferrel [mailto:p...@occamsmachete.com]
> *Sent:* Wednesday, May 09, 2018 4:41 PM
> *To:* user@predictionio.apache.org; gerasimos xydas
> *Subject:* Re: UR: build/train/deploy once & querying for 3 use cases
>
>
>
> Why do you want to throw away user behavior in making recommendations? The
> lift you get in purchases will be less.
>
>
>
> There is a use case for this when you are making recommendations basically
> inside a session where the user is browsing/viewing things on a hunt for
> something. In this case you would want to make recs using the user history
> of views but you have to build a model of purchase as the primary indicator
> or you won’t get purchase r

Re: UR evaluation

2018-05-10 Thread Pat Ferrel
Exactly, ranking is the only task of a recommender. Precision is not
automatically good at that but something like MAP@k is.


From: Marco Goldin <markomar...@gmail.com> <markomar...@gmail.com>
Date: May 10, 2018 at 10:09:22 PM
To: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>
Cc: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Subject:  Re: UR evaluation

Very nice article. And it gets much clearer the importance of treating the
recommendation like a ranking task.
Thanks

Il gio 10 mag 2018, 19:12 Pat Ferrel <p...@occamsmachete.com> ha scritto:

> Here is a discussion of how we used it for tuning with multiple input
> types:
> https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/
>
> We used video likes, dislikes, and video metadata to increase our MAP@k
> by 26% eventually. So this was mainly an exercise in incorporating data.
> Since this research was done we have learned how to better tune this type
> of situation but that’s a long story fit for another blog post.
>
>
> From: Marco Goldin <markomar...@gmail.com> <markomar...@gmail.com>
> Reply: user@predictionio.apache.org <user@predictionio.apache.org>
> <user@predictionio.apache.org>
> Date: May 10, 2018 at 9:54:23 AM
> To: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>
> Cc: user@predictionio.apache.org <user@predictionio.apache.org>
> <user@predictionio.apache.org>
> Subject:  Re: UR evaluation
>
> thank you very much, i didn't see this tool, i'll definitely try it.
> Clearly better to have such a specific instrument.
>
>
>
> 2018-05-10 18:36 GMT+02:00 Pat Ferrel <p...@occamsmachete.com>:
>
>> You can if you want but we have external tools for the UR that are much
>> more flexible. The UR has tuning that can’t really be covered by the built
>> in API. https://github.com/actionml/ur-analysis-tools They do MAP@k as
>> well as creating a bunch of other metrics and comparing different types of
>> input data. They use a running UR to make queries against.
>>
>>
>> From: Marco Goldin <markomar...@gmail.com> <markomar...@gmail.com>
>> Reply: user@predictionio.apache.org <user@predictionio.apache.org>
>> <user@predictionio.apache.org>
>> Date: May 10, 2018 at 7:52:39 AM
>> To: user@predictionio.apache.org <user@predictionio.apache.org>
>> <user@predictionio.apache.org>
>> Subject:  UR evaluation
>>
>> hi all, i successfully trained a universal recommender but i don't know
>> how to evaluate the model.
>>
>> Is there a recommended way to do that?
>> I saw that *predictionio-template-recommender* actually has
>> the Evaluation.scala file which uses the class *PrecisionAtK *for the
>> metrics.
>> Should i use this template to implement a similar evaluation for the UR?
>>
>> thanks,
>> Marco Goldin
>> Horizons Unlimited s.r.l.
>>
>>
>


Re: UR evaluation

2018-05-10 Thread Pat Ferrel
Here is a discussion of how we used it for tuning with multiple input types: 
https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/

We used video likes, dislikes, and video metadata to increase our MAP@k by 26% 
eventually. So this was mainly an exercise in incorporating data. Since this 
research was done we have learned how to better tune this type of situation but 
that’s a long story fit for another blog post.


From: Marco Goldin <markomar...@gmail.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
Date: May 10, 2018 at 9:54:23 AM
To: Pat Ferrel <p...@occamsmachete.com>
Cc: user@predictionio.apache.org <user@predictionio.apache.org>
Subject:  Re: UR evaluation  

thank you very much, i didn't see this tool, i'll definitely try it. Clearly 
better to have such a specific instrument.



2018-05-10 18:36 GMT+02:00 Pat Ferrel <p...@occamsmachete.com>:
You can if you want but we have external tools for the UR that are much more 
flexible. The UR has tuning that can’t really be covered by the built in API. 
https://github.com/actionml/ur-analysis-tools They do MAP@k as well as creating 
a bunch of other metrics and comparing different types of input data. They use 
a running UR to make queries against.


From: Marco Goldin <markomar...@gmail.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
Date: May 10, 2018 at 7:52:39 AM
To: user@predictionio.apache.org <user@predictionio.apache.org>
Subject:  UR evaluation

hi all, i successfully trained a universal recommender but i don't know how to 
evaluate the model.

Is there a recommended way to do that?
I saw that predictionio-template-recommender actually has the Evaluation.scala 
file which uses the class PrecisionAtK for the metrics. 
Should i use this template to implement a similar evaluation for the UR?

thanks,
Marco Goldin
Horizons Unlimited s.r.l.




Re: UR evaluation

2018-05-10 Thread Pat Ferrel
You can if you want but we have external tools for the UR that are much
more flexible. The UR has tuning that can’t really be covered by the built
in API. https://github.com/actionml/ur-analysis-tools They do MAP@k as well
as creating a bunch of other metrics and comparing different types of input
data. They use a running UR to make queries against.


From: Marco Goldin  
Reply: user@predictionio.apache.org 

Date: May 10, 2018 at 7:52:39 AM
To: user@predictionio.apache.org 

Subject:  UR evaluation

hi all, i successfully trained a universal recommender but i don't know how
to evaluate the model.

Is there a recommended way to do that?
I saw that *predictionio-template-recommender* actually has
the Evaluation.scala file which uses the class *PrecisionAtK *for the
metrics.
Should i use this template to implement a similar evaluation for the UR?

thanks,
Marco Goldin
Horizons Unlimited s.r.l.


Re: UR: build/train/deploy once & querying for 3 use cases

2018-05-09 Thread Pat Ferrel
Why do you want to throw away user behavior in making recommendations? The
lift you get in purchases will be less.

There is a use case for this when you are making recommendations basically
inside a session where the user is browsing/viewing things on a hunt for
something. In this case you would want to make recs using the user history
of views but you have to build a model of purchase as the primary indicator
or you won’t get purchase recommendations and believe me recommending views
is a road to bad results. People view many things they do not buy, putting
only view behavior that lead to purchases in the model. So create a model
with purchase as the primary indicator and view as the secondary.

Once you have the model use only the user’s session viewing history in the
as the Elasticsearch query.

This is a feature on our list.


From: gerasimos xydas 

Reply: user@predictionio.apache.org 

Date: May 9, 2018 at 6:20:46 AM
To: user@predictionio.apache.org 

Subject:  UR: build/train/deploy once & querying for 3 use cases

Hello everybody,

We are experimenting with the Universal Recommender to provide
recommendations for the 3 distinct use cases below:

- Get a product recommendation based on product views
- Get a product recommendation based on product purchases
- Get a product recommendation based on previous purchases and views (i.e.
users who viewed this bought that)

The event server is fed from a single app with two types of events: "view"
and "purchase".

1. How should we customize the query to fetch results for each separate
case?
2. Is it feasible to build/train/deploy only once, and query for all 3 use
cases?


Best Regards,
Gerasimos


Users of Scala 2.11

2018-04-24 Thread Pat Ferrel
Hi all,

Mahout has hit a bit of a bump in releasing a Scala 2.11 version. I was
able to build 0.13.0 for Scala 2.11 and have published it on github as a
Maven compatible repo. I’m also using it from SBT.

If anyone wants access let me know.


Users of Scala 2.11

2018-04-24 Thread Pat Ferrel
Hi all,

Mahout has hit a bit of a bump in releasing a Scala 2.11 version. I was
able to build 0.13.0 for Scala 2.11 and have published it on github as a
Maven compatible repo. I’m also using it from SBT.

If anyone wants access let me know.


Re: Info / resources for scaling PIO?

2018-04-24 Thread Pat Ferrel
PIO is based on the architecture of Spark, which uses HDFS. HBase also uses
HDFS. Scaling these are quite well documented on the web. Scaling PIO is
the same as scaling all it’s services. It is unlikely you’ll need it but
you can also have more than one PIO server behind a load balancer.

Don’t use local models, put them in HDFS. Don’t mess with NFS, it is not
the design point for PIO. Scaling Spark beyond one machine will require
HDFS anyway so use it.

I also advise against using ES for all storage. 4 things hit the event
storage, incoming events (input), training, where all events are read out
at high speed, optionally model storage (depending on the engine) and
queries usually hit the event storage. This will quickly overload one
service and ES is not built as an object retrieval DB. The only reason to
use ES for all storage is that it is convenient when doing development or
experimenting with engines. In production it would be risky to rely on ES
for all storage and you would still need to scale out Spark and therefore
HDFS.

There is a little written about various scaling models here:
http://actionml.com/docs/pio_by_actionml the the architecture and workflow
tab and there are a couple system install docs that cover scaling.


From: Adam Drew  
Reply: user@predictionio.apache.org 

Date: April 24, 2018 at 7:37:35 AM
To: user@predictionio.apache.org 

Subject:  Info / resources for scaling PIO?

Hi all!



Is there any info on how to scale PIO to multiple nodes? I’ve gone through
a lot of the docs on the site and haven’t found anything. I’ve tested PIO
running with HBASE and ES for metadata and events, and with using just ES
for both (my preference thusfar) and have my models on local storage. Would
scaling simply be a matter of deploying clustered ES, and then finding some
way to share my model storage, such as NFS or HDFS? The question then is
what (if anything) has to be done for the nodes to “know” about changes on
other nodes. For example, if the model gets trained on node A does node B
automatically know about that?



I hope that makes sense. I’m coming to PIO with no prior experience for the
underlying apache bits (spark, hbase / hdfs, etc) so there’s likely things
I’m not considering. Any help / docs / guidance is appreciated.



Thanks!

Adam


Re: pio deploy without spark context

2018-04-14 Thread Pat Ferrel
The need for Spark at query time depends on the engine. Which are you
using? The Universal Recommender, which I maintain, does not require Spark
for queries but uses PIO. We simply don’t use the Spark context so it is
ignored. To make PIO work you need to have the Spark code accessible but
that doesn’t mean there must be a Spark cluster, you can  set the Spark
master to “local” and there are no Spark resources used in the deployed pio
PredictionServer.

We have infra code to spin up a Spark cluster for training and bring it
back down afterward. This all works just fine. The UR PredictionServer also
has no need to be re-deployed since the model is hot-swapped after
training, Deploy once run forever. And no real requirement for Spark to do
queries.

So depending on the Engine the requirement for Spark is code level not
system level.


From: Donald Szeto  
Reply: user@predictionio.apache.org 

Date: April 13, 2018 at 4:48:15 PM
To: user@predictionio.apache.org 

Subject:  Re: pio deploy without spark context

Hi George,

This is unfortunately not possible now without modifying the source code,
but we are planning to refactor PredictionIO to be runtime-agnostic,
meaning the engine server would be independent and SparkContext would not
be created if not necessary.

We will start a discussion on the refactoring soon. You are very welcome to
add your input then, and any subsequent contribution would be highly
appreciated.

Regards,
Donald

On Fri, Apr 13, 2018 at 3:51 PM George Yarish 
wrote:

> Hi all,
>
> We use pio engine which doesn't require apache spark in serving time, but
> from my understanding anyway sparkContext will be created by "pio deploy"
> process by default.
> My question is there any way to deploy an engine avoiding creation of
> spark application if I don't need it?
>
> Thanks,
> George
>
>


Re: Hbase issue

2018-04-13 Thread Pat Ferrel
This may seem unhelpful now but for others it might be useful to mention some 
minimum PIO in production best practices:

1) PIO should IMO never be run in production on a single node. When all 
services share the same memory, cpu, and disk, it is very difficult to find the 
root cause to a problem.
2) backup data with pio export periodically
3) install monitoring for disk used, as well as response times and other 
factors so you get warnings before you get wedged.
4) PIO will store data forever. It is designed as an input only system. Nothing 
is dropped ever. This is clearly unworkable in real life so a feature was added 
to trim the event stream in a safe way in PIO 0.12.0. There is a separate 
Template for trimming the DB and doing other things like deduplication and 
other compression on some schedule that can and should be different than 
training. Do not use this template until you upgrade and make sure it is 
compatible with your template: https://github.com/actionml/db-cleaner


From: bala vivek 
Reply: user@predictionio.apache.org 
Date: April 13, 2018 at 2:50:26 AM
To: user@predictionio.apache.org 
Subject:  Re: Hbase issue  

Hi Donald,

Yes, I'm running on the single machine. PIO, hbase , elasticsearch, spark 
everything works on the same server. Let me know which file I need to remove 
because I have client data present in PIO. 

I have tried adding the entries in hbase-site.xml using the following link, 
after which I can see the Hmaster seems active but still, the error remains the 
same.

https://medium.com/@tjosepraveen/cant-get-connection-to-zookeeper-keepererrorcode-connectionloss-for-hbase-63746fbcdbe7


Hbase Error logs :- ( I have commented the server name)

2018-04-13 04:31:28,246 INFO  [RS:0;VD500042:49584-SendThread(localhost:2182)] 
zookeeper.ClientCnxn: Opening socket connection to server 
localhost/0:0:0:0:0:0:0:1:2182. Will not attempt to authenticate using SASL 
(unknown error)
2018-04-13 04:31:28,247 WARN  [RS:0;XX:49584-SendThread(localhost:2182)] 
zookeeper.ClientCnxn: Session 0x162be5554b90003 for server null, unexpected 
error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2018-04-13 04:31:28,553 ERROR [main] master.HMasterCommandLine: Master exiting
java.lang.RuntimeException: Master not initialized after 20ms seconds
        at 
org.apache.hadoop.hbase.util.JVMClusterUtil.startup(JVMClusterUtil.java:225)
        at 
org.apache.hadoop.hbase.LocalHBaseCluster.startup(LocalHBaseCluster.java:449)
        at 
org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:225)
        at 
org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:137)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at 
org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
        at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2436)
(END)

I have tried multiple time pio-stop-all and pio-start-all but no luck the 
service is not up.
If I install the hbase alone in the existing setup let me know what things I 
should consider. If anyone faced this issue please provide me the solution 
steps.

On Thu, Apr 12, 2018 at 9:13 PM, Donald Szeto  wrote:
Hi Bala,

Are you running a single-machine HBase setup? The ZooKeeper embedded in such a 
setup is pretty fragile to disk space issue and your ZNode might have corrupted.

If that’s indeed your setup, please take a look at HBase log files, 
specifically on messages from ZooKeeper. In this situation, one way to recover 
is to remove ZooKeeper files and let HBase recreate them, assuming from your 
log output that you don’t have other services depend on the same ZK.

Regards,
Donald

On Thu, Apr 12, 2018 at 5:34 AM bala vivek  wrote:
Hi,

I use PIO 0.10.0 version and hbase 1.2.4. The setup was working fine till today 
morning. I saw PIO was down as the mount space issue was present on the server 
and cleared the unwanted files.

After doing a pio-stop-all and pio-start-all the HMaster service is not 
working. I tried multiple times the pio restart.

I can see whenever I do a pio-stop-all and check the service using jps, the 
Hmaster seems running. Similarly I tried to run the ./start-hbase.sh script but 
still pio status is not showing as success.

pio error log :

[INFO] [Console$] Inspecting PredictionIO...
[INFO] [Console$] PredictionIO 0.10.0-incubating is installed at 
/opt/tools/PredictionIO-0.10.0-incubating
[INFO] [Console$] 

Re: how to set engine-variant in intellij idea

2018-04-10 Thread Pat Ferrel
There are instructions for using Intellij but, I wrote the last version, I
apologize that I can’t make them work anymore. If you get them to work you
would be doing the community a great service by telling us how or editing
the instructions.

http://predictionio.apache.org/resources/intellij/


From: qi zhang  
Reply: user@predictionio.apache.org 

Date: April 10, 2018 at 1:40:58 AM
To: user@predictionio.apache.org 

Subject:  how to set engine-variant in intellij idea

大家好:
   我用intellij idea部署模型遇到如下问题

请问engine-variant是什么,我在哪里可以得到这个参数的值,能否帮忙举一个例子告诉我怎么设置这个参数
谢谢!非常感谢!


ii_jftezj9g1_162aeb5cfe5db27b
Description: Binary data


ii_jftemvdq0_162aeaccac8ddd04
Description: Binary data


Re: Unclear problem with using S3 as a storage data source

2018-03-29 Thread Pat Ferrel
Ok, the problem, as I thought at first, is that Spark creates the model and the 
PredictionServer must read it.

My methods below still work. There is very little extra to creating a pseudo 
cluster for HDFS as far a performance if it is still running all on one machine.

You can also write it on the Spark/training machine ot localfs and copy it to 
the PredictionServer before deploy. A simple scp in a script would do that.

Again I have no knowledge of using S3 for such things. If that works, someone 
else will have to help.




From: Dave Novelli <d...@ultravioletanalytics.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
Date: March 29, 2018 at 6:19:58 AM
To: Pat Ferrel <p...@occamsmachete.com>
Cc: user@predictionio.apache.org <user@predictionio.apache.org>
Subject:  Re: Unclear problem with using S3 as a storage data source  

Sorry Pat, I think I took some shortcuts in my initial explanation that are 
causing some confusion :) I'll try laying everything out again in detail...

I have configured 2 servers in AWS:

Event/Prediction Server - t2.medium
- Runs permanently
- Using swap to deal with 4GB mem limit (I know, I know)
- ElasticSearch
- HBase (pseudo-distributed mode, using normal files instead of hdfs)
- Web server for events and 6 prediction models

Training Server - r4.large
- Only spun up to execute "pio train" for the 6 UR models I've configured then 
spun back down
- Spark

My specific problem is that running "pio train" on the training server when 
"LOCALFS" is set as the model data store will deposit all the stub files in 
.pio_store/models/.

When I run "pio deploy" on the Event/Prediction Server, it's looking for those 
files in the .pio_store/models/ directory on the Event/Prediction server, and 
they're obviously not there. If I manually copy the files from the Training 
server to the Event/Prediction server then "pio deploy" works as expected.

My thought is that if the Training server saves those model stub files to S3, 
then the Event/Prediction server can read those files from S3 and I won't have 
to manually copy them.


Hopefully this clears my situation up!


As a note - I realize t2.medium is not a feasible instance type for any 
significant production system, but I'm bootstrapping a demo system on a very 
tight budget for a site that will almost certainly have extremely low traffic. 
In my initial tests I've managed to get UR working on this configuration and 
will be doing some simple load testing soon to see how far I can push it before 
it crashes. Speed is obviously not an issue at the moment but once it is (and 
once there's some funding) that t2 will be replaced with an r4 or an m5

Cheers,
Dave


Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | d...@ultravioletanalytics.com

On Wed, Mar 28, 2018 at 7:40 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
Sorry then I don’t understand what part has no access to the file system on the 
single machine? 

Also a t2 is not going to work with PIO. Spark 2 along requires something like 
2g for a do-nothing empty executor and driver, so a real app will require 16g 
or so minimum (my laptop has 16g). Run the OS, HBase, ES, and Spark will get 
you to over 8g, then add data. Spark keeps all data needed at a given phase of 
the calculation in memory across the cluster, that’s where it gets it’s speed. 
Welcome to big-data :-)


From: Dave Novelli <d...@ultravioletanalytics.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
Date: March 28, 2018 at 3:47:35 PM
To: Pat Ferrel <p...@occamsmachete.com>
Cc: user@predictionio.apache.org <user@predictionio.apache.org>
Subject:  Re: Unclear problem with using S3 as a storage data source

I don't *think* I need more spark nodes - I'm just using the one for training 
on an r4.large instance I spin up and down as needed.

I was hoping to avoid adding any additional computational load to my 
Event/Prediction/HBase/ES server (all running on a t2.medium) so I am looking 
for a way to *not* install HDFS on there as well. S3 seemed like it would be a 
super convenient way to pass the model files back and forth, but it sounds like 
it wasn't implemented as a data source for the model repository for UR.

Perhaps that's something I could implement and contribute? I can *kinda* read 
Scala haha, maybe this would be a fun learning project. Do you think it would 
be fairly straightforward?


Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | d...@ultravioletanalytics.com

On Wed, Mar 28, 2018 at 6:01 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
So you need to have more Spark nodes and this is the problem?

If so setup HBase on pseudo-clustered HDFS so you have a master node address 
even though all storage is on one machine. Then you us

Re: Unclear problem with using S3 as a storage data source

2018-03-28 Thread Pat Ferrel
So you need to have more Spark nodes and this is the problem?

If so setup HBase on pseudo-clustered HDFS so you have a master node
address even though all storage is on one machine. Then you use that
version of HDFS to tell Spark where to look for the model. It give the
model a URI.

I have never used the raw S3 support, HDFS can also be backed by S3 but you
use HDFS APIs, it is an HDFS config setting to use S3.

It is a rather unfortunate side effect of PIO but there are 2 ways to solve
this with no extra servers.

Maybe someone else knows how to use S3 natively for the model stub?


From: Dave Novelli <d...@ultravioletanalytics.com>
<d...@ultravioletanalytics.com>
Date: March 28, 2018 at 12:13:12 PM
To: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>
Cc: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Subject:  Re: Unclear problem with using S3 as a storage data source

Well, it looks like the local file system isn't an option in a multi-server
configuration without manually setting up a process to transfer those stub
model files.

I trained models on one heavy-weight temporary instance, and then when I
went to deploy from the prediction server instance it failed due to missing
files. I copied the .pio_store/models directory from the training server
over to the prediction server and then was able to deploy.

So, in a dual-instance configuration what's the best way to store the
files? I'm using pseudo-distributed HBase with standard file system storage
instead of HDFS (my current aim is keeping down cost and complexity for a
pilot project).

Is S3 back on the table as on option?

On Fri, Mar 23, 2018 at 11:03 AM, Dave Novelli <
d...@ultravioletanalytics.com> wrote:

> Ahhh ok, thanks Pat!
>
>
> Dave Novelli
> Founder/Principal Consultant, Ultraviolet Analytics
> www.ultravioletanalytics.com | 919.210.0948 <(919)%20210-0948> |
> d...@ultravioletanalytics.com
>
> On Fri, Mar 23, 2018 at 8:08 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
>
>> There is no need to have Universal Recommender models put in S3, they are
>> not used and only exist (in stub form) because PIO requires them. The
>> actual model lives in Elasticsearch and uses special features of ES to
>> perform the last phase of the algorithm and so cannot be replaced.
>>
>> The stub PIO models have no data and will be tiny. putting them in HDFS
>> or the local file system is recommended.
>>
>>
>> From: Dave Novelli <d...@ultravioletanalytics.com>
>> <d...@ultravioletanalytics.com>
>> Reply: user@predictionio.apache.org <user@predictionio.apache.org>
>> <user@predictionio.apache.org>
>> Date: March 22, 2018 at 6:17:32 PM
>> To: user@predictionio.apache.org <user@predictionio.apache.org>
>> <user@predictionio.apache.org>
>> Subject:  Unclear problem with using S3 as a storage data source
>>
>> Hi all,
>>
>> I'm using the Universal Recommender template and I'm trying to switch
>> storage data sources from local file to S3 for the model repository. I've
>> read the page at https://predictionio.apache.org/system/anotherdatastore/
>> to try to understand the configuration requirements, but when I run pio
>> train it's indicating an error and nothing shows up in the s3 bucket:
>>
>> [ERROR] [S3Models] Failed to insert a model to
>> s3://pio-model/pio_modelAWJPjTYM0wNJe2iKBl0d
>>
>> I created a new bucket named "pio-model" and granted full public
>> permissions.
>>
>> Seemingly relevant settings from pio-env.sh:
>>
>> PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
>> PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=S3
>> ...
>>
>> PIO_STORAGE_SOURCES_S3_TYPE=s3
>> PIO_STORAGE_SOURCES_S3_REGION=us-west-2
>> PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio-model
>>
>> # I've tried with and without this
>> #PIO_STORAGE_SOURCES_S3_ENDPOINT=http://s3.us-west-2.amazonaws.com
>>
>> # I've tried with and without this
>> #PIO_STORAGE_SOURCES_S3_BASE_PATH=pio-model
>>
>>
>> Any suggestions where I can start troubleshooting my configuration?
>>
>> Thanks,
>> Dave
>>
>>
>


--
Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | d...@ultravioletanalytics.com


Re: Error when training The Universal Recommender 0.7.0 with PredictionIO 0.12.0-incubating

2018-03-27 Thread Pat Ferrel
Pio build requires that ES hosts are known to Spark, which write the model
to ES. You can pass these in on the `pio train` command line:

pio train … -- --conf spark.es.nodes=“node1,node2,node3”

notice no spaces in the quoted list of hosts, also notice the double dash,
which separates pio parameters from Spark parameters.

There is a way to pass this in using the sparkConf section in engine.json
but this is unreliable due to how the commas are treated in ES. The site
description for the UR in the small HA cluster has not been updated for
0.7.0 because we are expecting a Mahout release, which will greatly simplfy
the build process described in the README.


From: VI, Tran Tan Phong  
Reply: user@predictionio.apache.org 

Date: March 27, 2018 at 3:09:30 AM
To: user@predictionio.apache.org 

Subject:  Error when training The Universal Recommender 0.7.0 with
PredictionIO 0.12.0-incubating

Hi,



I am trying to build and train UR 0.7.0 with PredictionIO 0.12.0-incubating
on a local “Small HA Cluster” (http://actionml.com/docs/small_ha_cluster)
using Elasticsearch 5.5.2.

By following different steps of the how-to, I success to execute the “pio
build” command of U.R 7.0. But I am getting some errors on the following
step of “pio train”.



Here are the principal errors:

…

[INFO] [HttpMethodDirector] I/O exception (java.net.ConnectException)
caught when processing request: Connection refused (Connection refused)

[INFO] [HttpMethodDirector] Retrying request

[INFO] [HttpMethodDirector] I/O exception (java.net.ConnectException)
caught when processing request: Connection refused (Connection refused)

[INFO] [HttpMethodDirector] Retrying request

[INFO] [HttpMethodDirector] I/O exception (java.net.ConnectException)
caught when processing request: Connection refused (Connection refused)

[INFO] [HttpMethodDirector] Retrying request

[ERROR] [NetworkClient] Node [127.0.0.1:9200] failed (Connection refused
(Connection refused)); no other nodes left - aborting...

…



Exception in thread "main"
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES
version - typically this happens if the network/Elasticsearch cluster is
not accessible or when targeting a WAN/Cloud instance without the proper
setting 'es.nodes.wan.only'

…

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException:
Connection error (check network and/or proxy settings)- all nodes failed;
tried [[127.0.0.1:9200]]



The cluster Elasticsearch (aml-elasticsearch) is up, but is not listening
on localhost.



Here under is my config of ES 5.5.2

PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch

PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=aml-elasticsearch

PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=aml-master,aml-slave-1,aml-slave-2

PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9200,9200,9200

PIO_STORAGE_SOURCES_ELASTICSEARCH_SCHEMES=http

PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=/usr/local/elasticsearch



Did somebody get this kind of error before? Any help or suggestion would be
appreciated.



Thanks,

VI Tran Tan Phong
This message contains information that may be privileged or confidential
and is the property of the Capgemini Group. It is intended only for the
person to whom it is addressed. If you are not the intended recipient, you
are not authorized to read, print, retain, copy, disseminate, distribute,
or use this message or any part thereof. If you receive this message in
error, please notify the sender immediately and delete all copies of this
message.


Re: UR 0.7.0 - problem with training

2018-03-08 Thread Pat Ferrel
BTW I think you may have to push setting on the cli by adding “spark” to
the beginning of the key name:

*pio train -- --conf spark.es.nodes=**“**localhost" --driver-memory 8g
--executor-memory 8g*


From: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Date: March 8, 2018 at 11:04:55 AM
To: Wojciech Kowalski <wojci...@tomandco.co.uk> <wojci...@tomandco.co.uk>,
user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>, u...@predictionio.incubator.apache.org
<u...@predictionio.incubator.apache.org>
<u...@predictionio.incubator.apache.org>
Subject:  Re: UR 0.7.0 - problem with training

es.nodes is supposed to be a string with hostnames separated by commas.
Depending on how your containers are set to communicate with the outside
world (Docker networking or port mapping) you may also need to set the
port, which is 9200 by default.

If your container is using port mapping and maps the container port 9200 to
the localhost port of 9200 you should be ok with only setting hostnames in
engine.json.

es.nodes=“localhost”

But I suspect you didn’t set your container communication strategy because
this is the fallback that would have been tried with no valid setting.

If this is true look up how you set Docker to communicate, port mapping is
the simplest for a single all-in-one machine.


From: Wojciech Kowalski <wojci...@tomandco.co.uk> <wojci...@tomandco.co.uk>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Date: March 8, 2018 at 7:31:10 AM
To: u...@predictionio.incubator.apache.org
<u...@predictionio.incubator.apache.org>
<u...@predictionio.incubator.apache.org>
Subject:  UR 0.7.0 - problem with training

Hello, i am trying to set new UR 0.7.0 with  predictionio 0.12.0 but all
atempts are failing.



I cannot set in engine’s spark config section „es.config” as i get such
error:

org.elasticsearch.index.mapper.MapperParsingException: object mapping for
[sparkConf.es.nodes] tried to parse field [es.nodes] as object, but found a
concrete value



If i don’t set this up engine fail to train because it cannot find
elasticsearch on localhost as it’s running on a separate machine

Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException:
Connection error (check network and/or proxy settings)- all nodes failed;
tried [[localhost:9200]]



Passing es.nodes via cli –conf es.nodes=elasticsearch doesn’t help either :/

*pio train -- --conf es.nodes=elasticsearch --driver-memory 8g
--executor-memory 8g*



Anyone would give any advice what am i doing wrong ?

I have separate docker containers for hadoop,hbase,elasticsearch,pio



Same setup was working fine on 0.10 and UR 0.5



Thanks,

Wojciech Kowalski


Re: UR 0.7.0 - problem with training

2018-03-08 Thread Pat Ferrel
es.nodes is supposed to be a string with hostnames separated by commas.
Depending on how your containers are set to communicate with the outside
world (Docker networking or port mapping) you may also need to set the
port, which is 9200 by default.

If your container is using port mapping and maps the container port 9200 to
the localhost port of 9200 you should be ok with only setting hostnames in
engine.json.

es.nodes=“localhost”

But I suspect you didn’t set your container communication strategy because
this is the fallback that would have been tried with no valid setting.

If this is true look up how you set Docker to communicate, port mapping is
the simplest for a single all-in-one machine.


From: Wojciech Kowalski  
Reply: user@predictionio.apache.org 

Date: March 8, 2018 at 7:31:10 AM
To: u...@predictionio.incubator.apache.org


Subject:  UR 0.7.0 - problem with training

Hello, i am trying to set new UR 0.7.0 with  predictionio 0.12.0 but all
atempts are failing.



I cannot set in engine’s spark config section „es.config” as i get such
error:

org.elasticsearch.index.mapper.MapperParsingException: object mapping for
[sparkConf.es.nodes] tried to parse field [es.nodes] as object, but found a
concrete value



If i don’t set this up engine fail to train because it cannot find
elasticsearch on localhost as it’s running on a separate machine

Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException:
Connection error (check network and/or proxy settings)- all nodes failed;
tried [[localhost:9200]]



Passing es.nodes via cli –conf es.nodes=elasticsearch doesn’t help either :/

*pio train -- --conf es.nodes=elasticsearch --driver-memory 8g
--executor-memory 8g*



Anyone would give any advice what am i doing wrong ?

I have separate docker containers for hadoop,hbase,elasticsearch,pio



Same setup was working fine on 0.10 and UR 0.5



Thanks,

Wojciech Kowalski


Re: Spark 2.x/scala 2.11.x release

2018-03-03 Thread Pat Ferrel
LGTM


From: Andrew Palumbo <ap@outlook.com> <ap@outlook.com>
Reply: dev@mahout.apache.org <dev@mahout.apache.org> <dev@mahout.apache.org>
Date: March 2, 2018 at 3:38:38 PM
To: dev@mahout.apache.org <dev@mahout.apache.org> <dev@mahout.apache.org>
Subject:  Re: Spark 2.x/scala 2.11.x release

I've started a gdoc for a plan.


https://docs.google.com/document/d/1A8aqUORPp83vWa6fSqhC2jxUKEbDWWQ2lzqZ1V0xHS0/edit?usp=sharing


Please add comments, criticision, and alternate plans on on the doc.


--andy


From: Andrew Palumbo <ap@outlook.com>
Sent: Friday, March 2, 2018 6:11:53 PM
To: dev@mahout.apache.org
Subject: Re: Spark 2.x/scala 2.11.x release

Pat, could you explain what you mean by the "Real Problem"? I know that we
have a lot of problems, but in terms of this release, what is the major
blocker?

From: Pat Ferrel <p...@occamsmachete.com>
Sent: Friday, March 2, 2018 5:32:58 PM
To: Trevor Grant; dev@mahout.apache.org
Subject: Re: Spark 2.x/scala 2.11.x release

Scopt is so not an issue. None whatsoever. The problem is that drivers have
unmet runtime needs that are different than libs. Scopt has absolutely
nothing to do with this. It was from a false theory that there was no 2.11
version but it actually has 2.11, 2.12, 2.09, and according to D a native
version too.

Get on to the real problems and drop this non-problem. Anything that driver
needs but is not on the classpath will stop them at runtime.

Better to say that we would be closer to release if we dropped drivers.


From: Trevor Grant <trevor.d.gr...@gmail.com> <trevor.d.gr...@gmail.com>
Reply: dev@mahout.apache.org <dev@mahout.apache.org> <dev@mahout.apache.org>

Date: March 2, 2018 at 2:26:13 PM
To: Mahout Dev List <dev@mahout.apache.org> <dev@mahout.apache.org>
Subject: Re: Spark 2.x/scala 2.11.x release

Supposedly. I hard coded all of the poms to Scala 2.11 (closed PR unmerged)
Pat was still having issues w sbt- but the only dependency that was on 2.10
according to maven was scopt. /shrug



On Mar 2, 2018 4:20 PM, "Andrew Palumbo" <ap@outlook.com> wrote:

> So We could release as is if we can get the scopt issue out? Thats our
> final blocker?
>
> 
> From: Trevor Grant <trevor.d.gr...@gmail.com>
> Sent: Friday, March 2, 2018 5:15:35 PM
> To: Mahout Dev List
> Subject: Re: Spark 2.x/scala 2.11.x release
>
> The only "mess" is in the cli spark drivers, namely scopt.
>
> Get rid of the drivers/fix the scopt issue- we have no mess.
>
>
>
> On Mar 2, 2018 4:09 PM, "Pat Ferrel" <p...@occamsmachete.com> wrote:
>
> > BTW the mess master is in is why git flow was invented and why I asked
> that
> > the site be in a new repo so it could be on a separate release cycle.
We
> > perpetuate the mess because it’s always to hard to fix.
> >
> >
> > From: Andrew Palumbo <ap@outlook.com> <ap@outlook.com>
> > Reply: dev@mahout.apache.org <dev@mahout.apache.org> <
> > dev@mahout.apache.org>
> > Date: March 2, 2018 at 1:54:51 PM
> > To: dev@mahout.apache.org <dev@mahout.apache.org> <dev@mahout.apache.org
> >
> > Subject: Re: Spark 2.x/scala 2.11.x release
> >
> > re: reverting master, shit. I forgot that the website is not on
> `asf-site`
> > anymore. Well we could just re-jigger it, and check out `website` from
> > features/multi-artifact-build-MAHOUT-20xx after we revert the rest of
> > master.
> >
> >
> > You're right, Trevor- I 'm just going through the commits, and there
are
> > things like
> > https://github.com/apache/mahout/commit/c17bee3c2705495b638d81ae2ad374
> > bf7494c3f3
> >
> >
> >
> > [https://avatars3.githubusercontent.com/u/5852441?s=200=4]<
> > https://github.com/apache/mahout/commit/c17bee3c2705495b638d81ae2ad374
> > bf7494c3f3>
> >
> >
> > MAHOUT-1988 Make Native Solvers Scala 2.11 Complient closes apache/ma…
·
> > apache/mahout@c17bee3<
> > https://github.com/apache/mahout/commit/c17bee3c2705495b638d81ae2ad374
> > bf7494c3f3>
> >
> > github.com
> > …hout#326
> >
> >
> >
> > (make Native Solvers Scala 2.11 compliant) and others peppered in, Post
> > 0.13.0. It still may be possible and not that hard, to cherrypick
> > everything after 0.13.0 that we want. But I see what you're saying
about
> it
> > not being completely simple.
> >
> >
> > As for Git-Flow. I dont really care. I use it in some projects and in
> > others i use GitHub-flow. (basically what we've been doing with merging
> >

Re: Spark 2.x/scala 2.11.x release

2018-03-02 Thread Pat Ferrel
Scopt is so not an issue. None whatsoever. The problem is that drivers have
unmet runtime needs that are different than libs. Scopt has absolutely
nothing to do with this. It was from a false theory that there was no 2.11
version but it actually has 2.11, 2.12, 2.09, and according to D a native
version too.

Get on to the real problems and drop this non-problem. Anything that driver
needs but is not on the classpath will stop them at runtime.

Better to say that we would be closer to release if we dropped drivers.


From: Trevor Grant <trevor.d.gr...@gmail.com> <trevor.d.gr...@gmail.com>
Reply: dev@mahout.apache.org <dev@mahout.apache.org> <dev@mahout.apache.org>
Date: March 2, 2018 at 2:26:13 PM
To: Mahout Dev List <dev@mahout.apache.org> <dev@mahout.apache.org>
Subject:  Re: Spark 2.x/scala 2.11.x release

Supposedly. I hard coded all of the poms to Scala 2.11 (closed PR unmerged)
Pat was still having issues w sbt- but the only dependency that was on 2.10
according to maven was scopt. /shrug



On Mar 2, 2018 4:20 PM, "Andrew Palumbo" <ap@outlook.com> wrote:

> So We could release as is if we can get the scopt issue out? Thats our
> final blocker?
>
> 
> From: Trevor Grant <trevor.d.gr...@gmail.com>
> Sent: Friday, March 2, 2018 5:15:35 PM
> To: Mahout Dev List
> Subject: Re: Spark 2.x/scala 2.11.x release
>
> The only "mess" is in the cli spark drivers, namely scopt.
>
> Get rid of the drivers/fix the scopt issue- we have no mess.
>
>
>
> On Mar 2, 2018 4:09 PM, "Pat Ferrel" <p...@occamsmachete.com> wrote:
>
> > BTW the mess master is in is why git flow was invented and why I asked
> that
> > the site be in a new repo so it could be on a separate release cycle.
We
> > perpetuate the mess because it’s always to hard to fix.
> >
> >
> > From: Andrew Palumbo <ap@outlook.com> <ap@outlook.com>
> > Reply: dev@mahout.apache.org <dev@mahout.apache.org> <
> > dev@mahout.apache.org>
> > Date: March 2, 2018 at 1:54:51 PM
> > To: dev@mahout.apache.org <dev@mahout.apache.org> <dev@mahout.apache.org
> >
> > Subject: Re: Spark 2.x/scala 2.11.x release
> >
> > re: reverting master, shit. I forgot that the website is not on
> `asf-site`
> > anymore. Well we could just re-jigger it, and check out `website` from
> > features/multi-artifact-build-MAHOUT-20xx after we revert the rest of
> > master.
> >
> >
> > You're right, Trevor- I 'm just going through the commits, and there
are
> > things like
> > https://github.com/apache/mahout/commit/c17bee3c2705495b638d81ae2ad374
> > bf7494c3f3
> >
> >
> >
> > [https://avatars3.githubusercontent.com/u/5852441?s=200=4]<
> > https://github.com/apache/mahout/commit/c17bee3c2705495b638d81ae2ad374
> > bf7494c3f3>
> >
> >
> > MAHOUT-1988 Make Native Solvers Scala 2.11 Complient closes apache/ma…
·
> > apache/mahout@c17bee3<
> > https://github.com/apache/mahout/commit/c17bee3c2705495b638d81ae2ad374
> > bf7494c3f3>
> >
> > github.com
> > …hout#326
> >
> >
> >
> > (make Native Solvers Scala 2.11 compliant) and others peppered in, Post
> > 0.13.0. It still may be possible and not that hard, to cherrypick
> > everything after 0.13.0 that we want. But I see what you're saying
about
> it
> > not being completely simple.
> >
> >
> > As for Git-Flow. I dont really care. I use it in some projects and in
> > others i use GitHub-flow. (basically what we've been doing with merging
> > everything to master).
> >
> >
> > Though this exact problem that we have right now is why git-flow is
nice.
> > Lets separate the question of how we go forward, with what commit/repo
> > style, and First figure out how to back out what we have now, without
> > loosing all of the work that you did on the multi artifact build.
> >
> >
> > What do you think about reverting to 0.13.0, and cherry picking commits
> > like Sparse Speedup:
> > https://github.com/apache/mahout/commit/800a9ed6d7e015aa82b9eb7624bb44
> > 1b71a8f397
> > or checking out entire folders like `website`?
> >
> > [https://avatars3.githubusercontent.com/u/326731?s=200=4]<
> > https://github.com/apache/mahout/commit/800a9ed6d7e015aa82b9eb7624bb44
> > 1b71a8f397>
> >
> >
> > MAHOUT-2019 SparkRow Matrix Speedup and fixing change to scala 2.11 m…
·
> > apache/mahout@800a9ed<
> > https://github.com/apache/mahout/commit/800a9ed6d7e015aa82b9eb7624bb44
> > 1b71a8f397>
> >
> > githu

Re: Spark 2.x/scala 2.11.x release

2018-03-02 Thread Pat Ferrel
BTW the mess master is in is why git flow was invented and why I asked that
the site be in a new repo so it could be on a separate release cycle. We
perpetuate the mess because it’s always to hard to fix.


From: Andrew Palumbo <ap@outlook.com> <ap@outlook.com>
Reply: dev@mahout.apache.org <dev@mahout.apache.org> <dev@mahout.apache.org>
Date: March 2, 2018 at 1:54:51 PM
To: dev@mahout.apache.org <dev@mahout.apache.org> <dev@mahout.apache.org>
Subject:  Re: Spark 2.x/scala 2.11.x release

re: reverting master, shit. I forgot that the website is not on `asf-site`
anymore. Well we could just re-jigger it, and check out `website` from
features/multi-artifact-build-MAHOUT-20xx after we revert the rest of
master.


You're right, Trevor- I 'm just going through the commits, and there are
things like
https://github.com/apache/mahout/commit/c17bee3c2705495b638d81ae2ad374bf7494c3f3



[https://avatars3.githubusercontent.com/u/5852441?s=200=4]<
https://github.com/apache/mahout/commit/c17bee3c2705495b638d81ae2ad374bf7494c3f3>


MAHOUT-1988 Make Native Solvers Scala 2.11 Complient closes apache/ma… ·
apache/mahout@c17bee3<
https://github.com/apache/mahout/commit/c17bee3c2705495b638d81ae2ad374bf7494c3f3>

github.com
…hout#326



(make Native Solvers Scala 2.11 compliant) and others peppered in, Post
0.13.0. It still may be possible and not that hard, to cherrypick
everything after 0.13.0 that we want. But I see what you're saying about it
not being completely simple.


As for Git-Flow. I dont really care. I use it in some projects and in
others i use GitHub-flow. (basically what we've been doing with merging
everything to master).


Though this exact problem that we have right now is why git-flow is nice.
Lets separate the question of how we go forward, with what commit/repo
style, and First figure out how to back out what we have now, without
loosing all of the work that you did on the multi artifact build.


What do you think about reverting to 0.13.0, and cherry picking commits
like Sparse Speedup:
https://github.com/apache/mahout/commit/800a9ed6d7e015aa82b9eb7624bb441b71a8f397
or checking out entire folders like `website`?

[https://avatars3.githubusercontent.com/u/326731?s=200=4]<
https://github.com/apache/mahout/commit/800a9ed6d7e015aa82b9eb7624bb441b71a8f397>


MAHOUT-2019 SparkRow Matrix Speedup and fixing change to scala 2.11 m… ·
apache/mahout@800a9ed<
https://github.com/apache/mahout/commit/800a9ed6d7e015aa82b9eb7624bb441b71a8f397>

github.com
…ade by build script




From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Friday, March 2, 2018 3:58:07 PM
To: Mahout Dev List
Subject: Re: Spark 2.x/scala 2.11.x release

If you revert master to the release tag you're going to destroy the
website.

The website pulls and rebuilds from mater whenever Jenkins detects a
change.

mahout-0.13.0 has no website. So it will pull nothing and there will be no
site.

tg


On Fri, Mar 2, 2018 at 1:24 PM, Andrew Palumbo <ap@outlook.com> wrote:

>
> Sounds Good. I'll put out a proposal for the release, and we can go Over
> it and vote if we want to on releasing or on the scope. I'm +1 on it.
>
>
> Broad strokes of what I'm thinking:
>
>
> - Checkout a new branch "features/multi-artifact-build-22xx" from master
> @ the `mahout-0.13.0` release tag.
>
>
> - Revert master back to release tag.
>
>
> - Checkout a new `develop` branch from master @the `mahout-0.13.0`
release
> tag.
>
>
> - Cherrypick any commits that we'd like to release (E.g.: SparseSpeedup)
> onto `develop` (along with a PR ad a ticket).
>
>
> - Merge `develop` to `master`, run through Smoke tests, tag master @
> `mahout-0.13.1`(automatically), and release.
>
>
> This will also get us to more of a git-flow workflow, as we've discussed
> moving towards.
>
>
> Thoughts @all?
>
>
> --andy
>
>
>
>
>
>
> 
> From: Pat Ferrel <pat.fer...@gmail.com>
> Sent: Wednesday, February 28, 2018 2:53:58 PM
> To: Andrew Palumbo; dev@mahout.apache.org
> Subject: Re: Spark 2.x/scala 2.11.x release
>
> big +1
>
> If you are planning to branch off the 0.13.0 tag let me know, I have a
> speedup that is in my scala 2.11 fork of 0.13.0 that needs to be released
>
>
> From: Andrew Palumbo <ap@outlook.com><mailto:ap@outlook.com>
> Reply: dev@mahout.apache.org<mailto:dev@mahout.apache.org> <
> dev@mahout.apache.org><mailto:dev@mahout.apache.org>
> Date: February 28, 2018 at 11:16:12 AM
> To: dev@mahout.apache.org<mailto:dev@mahout.apache.org> <
> dev@mahout.apache.org><mailto:dev@mahout.apache.org>
> Subject: Spark 2.x/scala 2.11.x release
>
> After some offline discussion regarding people's needs

Re: Spark 2.x/scala 2.11.x release

2018-03-02 Thread Pat Ferrel
-1 on git flow?

What are your reasons? I use it in 3-4 project, one of which is Apache PIO.
I’d think this would be a good example of why it’s nice. right now our
master is screwed up, with git flow it would still have 0.13.0, which is
what we are talking about releasing with minor mods.


From: Trevor Grant <trevor.d.gr...@gmail.com> <trevor.d.gr...@gmail.com>
Reply: dev@mahout.apache.org <dev@mahout.apache.org> <dev@mahout.apache.org>
Date: March 2, 2018 at 11:36:52 AM
To: Mahout Dev List <dev@mahout.apache.org> <dev@mahout.apache.org>
Subject:  Re: Spark 2.x/scala 2.11.x release

I'm +1 for a working scala-2.11, spark-2.x build.

I'm -1 for previously stated reasons on git-flow.

tg


On Fri, Mar 2, 2018 at 1:24 PM, Andrew Palumbo <ap@outlook.com> wrote:

>
> Sounds Good. I'll put out a proposal for the release, and we can go Over
> it and vote if we want to on releasing or on the scope. I'm +1 on it.
>
>
> Broad strokes of what I'm thinking:
>
>
> - Checkout a new branch "features/multi-artifact-build-22xx" from master
> @ the `mahout-0.13.0` release tag.
>
>
> - Revert master back to release tag.
>
>
> - Checkout a new `develop` branch from master @the `mahout-0.13.0`
release
> tag.
>
>
> - Cherrypick any commits that we'd like to release (E.g.: SparseSpeedup)
> onto `develop` (along with a PR ad a ticket).
>
>
> - Merge `develop` to `master`, run through Smoke tests, tag master @
> `mahout-0.13.1`(automatically), and release.
>
>
> This will also get us to more of a git-flow workflow, as we've discussed
> moving towards.
>
>
> Thoughts @all?
>
>
> --andy
>
>
>
>
>
>
> 
> From: Pat Ferrel <pat.fer...@gmail.com>
> Sent: Wednesday, February 28, 2018 2:53:58 PM
> To: Andrew Palumbo; dev@mahout.apache.org
> Subject: Re: Spark 2.x/scala 2.11.x release
>
> big +1
>
> If you are planning to branch off the 0.13.0 tag let me know, I have a
> speedup that is in my scala 2.11 fork of 0.13.0 that needs to be released
>
>
> From: Andrew Palumbo <ap@outlook.com><mailto:ap@outlook.com>
> Reply: dev@mahout.apache.org<mailto:dev@mahout.apache.org> <
> dev@mahout.apache.org><mailto:dev@mahout.apache.org>
> Date: February 28, 2018 at 11:16:12 AM
> To: dev@mahout.apache.org<mailto:dev@mahout.apache.org> <
> dev@mahout.apache.org><mailto:dev@mahout.apache.org>
> Subject: Spark 2.x/scala 2.11.x release
>
> After some offline discussion regarding people's needs for Spark and 2.x
> and Scala 2.11.x, I am wondering If we should just consider a release for
> 2.x and 2.11.x as the default. We could release from the current master,
or
> branch back off of the 0.13.0 tag, and release that with the upgraded
> defaults, and branch our current multi-artifact build off as a feature.
Any
> thoughts on this?
>
>
> --andy
>


  1   2   3   4   5   6   7   8   9   10   >