Re: Remove engine registration

2016-09-16 Thread Kenneth Chan
Pat, would you explain more about the 'instanceId' as in
`pio register --variant path/to/some-engine.json --instanceId
some-REST-compatible-resource-id`  ?

Currently PIO also has a concept of engineInstanceId, which is output of
train. I think you are referring to different thing, right?

Kenneth


On Fri, Sep 16, 2016 at 12:58 PM, Pat Ferrel  wrote:

> This is a great discussion topic and a great idea.
>
> However the cons must also be addressed, we will need to do this before
> multi-tenant deploys can happen and the benefits are just as large as
> removing `pio build`
>
> It would be great to get rid of manifest.json and put all metadata in the
> store with an externally visible id so all parts of the workflow on all
> machines will get the right metadata and any template specific commands
> will run from anywhere on any cluster machine and in any order. All we need
> is a global engine-instance id. This will make engine-instances behave more
> like datasets, which are given permanent ids with `pio app new …` This
> might be a new form of `pio register` and it implies a new optional param
> to pio template specific commands (the instance id) but removes a lot of
> misunderstandings people have and easy mistakes in workflow.
>
> So workflow would be:
> 1) build with SBT/mvn
> 2) register any time engine.json changes so make the json file an optional
> param to `pio register --variant path/to/some-engine.json --instanceId
> some-REST-compatible-resource-id` the instance could also be
> auto-generated and output or optionally in the engine.json. `pio engine
> list` lists registered instances with instanceId. The path to the binary
> would be put in the instanceId and would be expected to be the same on all
> cluster machines that need it.
> 3) `pio train --instanceId` optional if it’s in engine.json
> 4) `pio deploy --instanceId` optional if it’s in engine.json
> 5) with easily recognized exceptions all the above can happen in any order
> on any cluster machine and from any directory.
>
> This takes one big step to multi-tenancy since the instance data has an
> externally visible id—call it a REST resource id…
>
> I bring this up not to confuse the issue but because if we change the
> workflow commands we should avoid doing it often because of the disruption
> it brings.
>
>
> On Sep 16, 2016, at 10:42 AM, Donald Szeto  wrote:
>
> Hi all,
>
> I want to start the discussion of removing engine registration. How many
> people actually take advantage of being able to run pio commands everywhere
> outside of an engine template directory? This will be a nontrivial change
> on the operational side so I want to gauge the potential impact to existing
> users.
>
> Pros:
> - Stateless build. This would work well with many PaaS.
> - Eliminate the "pio build" command once and for all.
> - Ability to use your own build system, i.e. Maven, Ant, Gradle, etc.
> - Potentially better experience with IDE since engine templates no longer
> depends on an SBT plugin.
>
> Cons:
> - Inability to run pio engine training and deployment commands outside of
> engine template directory.
> - No automatic version matching of PIO binary distribution and artifacts
> version used in the engine template.
> - A less unified user experience: from pio-build-train-deploy to build,
> then pio-train-deploy.
>
> Regards,
> Donald
>
>


Re: Remove engine registration

2016-09-16 Thread Pat Ferrel
This is a great discussion topic and a great idea.

However the cons must also be addressed, we will need to do this before 
multi-tenant deploys can happen and the benefits are just as large as removing 
`pio build`

It would be great to get rid of manifest.json and put all metadata in the store 
with an externally visible id so all parts of the workflow on all machines will 
get the right metadata and any template specific commands will run from 
anywhere on any cluster machine and in any order. All we need is a global 
engine-instance id. This will make engine-instances behave more like datasets, 
which are given permanent ids with `pio app new …` This might be a new form of 
`pio register` and it implies a new optional param to pio template specific 
commands (the instance id) but removes a lot of misunderstandings people have 
and easy mistakes in workflow.

So workflow would be:
1) build with SBT/mvn
2) register any time engine.json changes so make the json file an optional 
param to `pio register --variant path/to/some-engine.json --instanceId 
some-REST-compatible-resource-id` the instance could also be auto-generated and 
output or optionally in the engine.json. `pio engine list` lists registered 
instances with instanceId. The path to the binary would be put in the 
instanceId and would be expected to be the same on all cluster machines that 
need it.
3) `pio train --instanceId` optional if it’s in engine.json
4) `pio deploy --instanceId` optional if it’s in engine.json
5) with easily recognized exceptions all the above can happen in any order on 
any cluster machine and from any directory.

This takes one big step to multi-tenancy since the instance data has an 
externally visible id—call it a REST resource id…

I bring this up not to confuse the issue but because if we change the workflow 
commands we should avoid doing it often because of the disruption it brings.


On Sep 16, 2016, at 10:42 AM, Donald Szeto  wrote:

Hi all,

I want to start the discussion of removing engine registration. How many people 
actually take advantage of being able to run pio commands everywhere outside of 
an engine template directory? This will be a nontrivial change on the 
operational side so I want to gauge the potential impact to existing users.

Pros:
- Stateless build. This would work well with many PaaS.
- Eliminate the "pio build" command once and for all.
- Ability to use your own build system, i.e. Maven, Ant, Gradle, etc.
- Potentially better experience with IDE since engine templates no longer 
depends on an SBT plugin.

Cons:
- Inability to run pio engine training and deployment commands outside of 
engine template directory.
- No automatic version matching of PIO binary distribution and artifacts 
version used in the engine template.
- A less unified user experience: from pio-build-train-deploy to build, then 
pio-train-deploy.

Regards,
Donald



Re: Batch import, Java

2016-09-16 Thread Pat Ferrel
Which brings up the next set of issues: What do we do for Salesforce owned 
SDKs? Can the SDKs be donated?

In any case I suggest we add to the “gallery” so it might be better categorized 
to include SDKs, Templates, other extras like containers or whatnot. If we are 
agreed about external templates then we could treat SDKs and the rest in the 
same manner.

1) PR to the Gallery page to request inclusion
2) Committer reviews and pushes the change if the inclusion seems worthwhile 
and includes a link to some support. License type is irrelevant since we are 
not publishing any of this, only descriptions and links.
3) Donations of the work can be processed if there is a provision for continued 
support by a current committer or by the author and all other due process is 
followed.

That way if the Salesforce owned SDKs are donated—great. If not we still can 
put them or some fork of them in the Gallery.

Sound good?


On Sep 11, 2016, at 1:09 AM, Kenneth Chan  wrote:

I think the Java SDK already supports it creating events to a file, but it's 
not documented.
https://github.com/PredictionIO/PredictionIO-Java-SDK/commit/6691144ebf1382aa1d060770a4fb7c0268f849d3
 




On Fri, Sep 9, 2016 at 7:59 AM, Pat Ferrel > wrote:
The page is now live


On Sep 8, 2016, at 11:49 AM, Gustavo Frederico > wrote:

The page at 
http://predictionio.incubator.apache.org/datacollection/batchimport/ 

displays "(coming soon)" for the Java SDK. Any ideas about when that will 
happen?

Thanks

Gustavo








Remove engine registration

2016-09-16 Thread Donald Szeto
Hi all,

I want to start the discussion of removing engine registration. How many
people actually take advantage of being able to run pio commands everywhere
outside of an engine template directory? This will be a nontrivial change
on the operational side so I want to gauge the potential impact to existing
users.

Pros:
- Stateless build. This would work well with many PaaS.
- Eliminate the "pio build" command once and for all.
- Ability to use your own build system, i.e. Maven, Ant, Gradle, etc.
- Potentially better experience with IDE since engine templates no longer
depends on an SBT plugin.

Cons:
- Inability to run pio engine training and deployment commands outside of
engine template directory.
- No automatic version matching of PIO binary distribution and artifacts
version used in the engine template.
- A less unified user experience: from pio-build-train-deploy to build,
then pio-train-deploy.

Regards,
Donald


Re: [VOTE]: Apache PredictionIO (incubating) 0.10.0 Release

2016-09-16 Thread Donald Szeto
If everyone agrees that the artifacts should have an "apache-" prefix, I
will roll an RC2 shortly. Since the namespace change is unavoidable, might
as well get all artifact name changes done in one shot for good.

On Friday, September 16, 2016, Andrew Purtell 
wrote:

> Let me double check.
>
> > On Sep 16, 2016, at 7:33 AM, Alex Merritt  > wrote:
> >
> > I believe it depends on which of the two votes you mean. For the podling
> > vote, PPMC votes are binding, for the incubator vote,  IPMC votes are,
> no?
> >
> >> On Sep 15, 2016 9:42 PM, "Andrew Purtell"  > wrote:
> >>
> >> I believe 'binding' only applies to IPMC.
> >>
>  On Sep 15, 2016, at 12:49 PM, Suneel Marthi  >
> >>> wrote:
> >>>
> >>> Folks, When u vote please specify "+1 Binding" if u r a PMC member. Its
> >>> only the PMC votes that count for a release to pass.
> >>>
> >>>
> >>>
>  On Thu, Sep 15, 2016 at 2:11 PM, Robert Lu  >
> >> wrote:
> 
>  +1
> 
> > On Sep 15, 2016, at 01:13, Matthew Tovbin  > wrote:
> >
> > +1
> >
> >> On Wed, Sep 14, 2016 at 10:12 AM, Pat Ferrel  >
> > wrote:
> >
> >> +1
> >>
> >>
> >> On Sep 13, 2016, at 11:55 AM, Donald Szeto  >
>  wrote:
> >>
> >> This is the vote for 0.10.0 of Apache PredictionIO (incubating).
> >>
> >> The vote will run for at least 72 hours and will close on Sept 16th,
>  2016.
> >>
> >> The artifacts can be downloaded here:
> >> https://dist.apache.org/repos/dist/dev/incubator/predictioni
> >> o/0.10.0-incubating-rc1/
> >> or
> >> from the Maven staging repo here:
> >> https://repository.apache.org/content/repositories/orgapache
> >> predictionio-1001/
> >>
> >> All JIRAs completed for this release are tagged with 'FixVersion =
>  0.10.0'.
> >> You can view them here:
> >> https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
> >> ctId=12320420=12337844
> >>
> >> The artifacts have been signed with Key : 8BF4ABEB
> >>
> >> Please vote accordingly:
> >>
> >> [ ] +1, accept RC as the official 0.10.0 release
> >> [ ] -1, do not accept RC as the official 0.10.0 release because...
> >>
>


[GitHub] incubator-predictionio pull request #295: [PIO-30] Set up a cross build for ...

2016-09-16 Thread Ziemin
GitHub user Ziemin opened a pull request:

https://github.com/apache/incubator-predictionio/pull/295

[PIO-30] Set up a cross build for Scala 2.10 (Spark 1.6.2) and Scala …

This PR introduces a simple profile-based build of PredictionIO, which 
comes along with some upgrades including a new version of Spark. 

### The key changes include:

* build.sbt - here I created two profiles with different sets of artifacts 
to be included. Their names are _scala-2.11_  and _scala-2.10_, where the 
former is chosen by default. In order to set a desired profile for sbt command 
`-Dbuild.profile=` property has to be provided. 
You can print the description of some profile by e.g. 
```sbt -Dbuild.profile=scala-2.10 printProfile```
The _scala-2.11_ settings include Spark version 2.0.0, while _scala-2.11_ 
sets it to 1.6.2. This can be configured, by adding a dedicated property:
``` sbt -Dbuild.profile=scala-2.11 -Dspark.version=1.6.0 -D 
Dhadoop.version=2.6.4  ```
This command will set a build profile to _scala-2.11_, but will use a 
different versions of Spark and Hadoop. It makes the configuration more 
flexible, especially if someone wants to build the project according to their 
own needs. 
Very important thing to note is that versions of spark before 1.6.x are no 
longer supported and scala 2.10.x is deprecated. 

* 
`data/src/main/spark-1/org/apache/predictionio/data/SparkVersionDependent.scala`
 and 
`data/src/main/spark-2/org/apache/predictionio/data/SparkVersionDependent.scala`
 - are the only examples of version dependent code. They are solely for 
providing a proper type of an object for Spark sql related actions. Sbt is 
configured to include version specific source paths like these.

* make_distribution.sh - in order to create an archive for Scala 2.10 one 
has to provide it with an argument in the same way as sbt 
(./make_distribution.sh -Dbuild.profile=scala-2.10). By default it will build 
for _scala-2.11_.

* integration tests - The docker image is updated, I pushed it with a tag 
spark_2.0.0 not to interfere with the current build. It contains both versions 
of Spark and on startup sets up environment according to dependencies that 
predictionIO was built with. It uses a simple Java program 
`tests/docker-files/BuildInfoPrinter.java` linked with the assembly of 
PredictionIO to acquire necessary information. Travis CI makes use of the setup 
and runs 8 parallel builds, the number doubled because of introducing two 
different build profiles.

I also noticed that there are many places hardcoding different version 
numbers and links to packages to be downloaded. Maintaining the project and 
keeping everything consistent gets only more difficult, therefore I came up 
with a few small scripts getting proper versions of dependencies from the build 
config and setting some variables accordingly. An example is `conf/vendors.sh`, 
which provided that some variables are set (e.g. by `dev/set-build-profile.sh`) 
initializes some other useful variables, i.e. `SPARK_DOWNLOAD, SPARK_ARCHIVE, 
SPARK_DIRNAME`. They are used in travis configuration as well as in the 
Dockerfile, which now should be built with a dedicated script 
`tests/docker-build.sh`. Such approach makes it easier to keep everything 
coherent while bumping version numbers.

### Some problems encountered during upgrade
Updating Spark caused some troubles

* The classpath has to be extended to run the unit tests successfully for 
some sub-packages. (see `build.sbt`)
* Column names have to be handled differently for Postgres in JDBCPevents, 
as Spark surrounds them with "..." what breaks the current schema in this case
* `tests/pio_tests/utils.py` - has a special Spark pass through argument to 
set `spark.sql.warehouse.dir`, because the defaults cause runtime exceptions. 
See -> 
[here](https://mail-archives.apache.org/mod_mbox/spark-user/201608.mbox/%3ccamassd+efz+uscmnzvkfp00qbr9ynv8lrfhvz9lrmnwh2vk...@mail.gmail.com%3E)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Ziemin/incubator-predictionio upgrade

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-predictionio/pull/295.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #295


commit 39879f04639b123510f0668e0de7b33fd7418784
Author: Marcin Ziemiński 
Date:   2016-08-09T21:20:54Z

[PIO-30] Set up a cross build for Scala 2.10 (Spark 1.6.2) and Scala 2.11
(Spark 2.0.0).

Changes also include updating travis integration tests, which run now
eight parallel builds.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled 

Re: [VOTE]: Apache PredictionIO (incubating) 0.10.0 Release

2016-09-16 Thread Andrew Purtell
Let me double check. 

> On Sep 16, 2016, at 7:33 AM, Alex Merritt  wrote:
> 
> I believe it depends on which of the two votes you mean. For the podling
> vote, PPMC votes are binding, for the incubator vote,  IPMC votes are, no?
> 
>> On Sep 15, 2016 9:42 PM, "Andrew Purtell"  wrote:
>> 
>> I believe 'binding' only applies to IPMC.
>> 
 On Sep 15, 2016, at 12:49 PM, Suneel Marthi 
>>> wrote:
>>> 
>>> Folks, When u vote please specify "+1 Binding" if u r a PMC member. Its
>>> only the PMC votes that count for a release to pass.
>>> 
>>> 
>>> 
 On Thu, Sep 15, 2016 at 2:11 PM, Robert Lu 
>> wrote:
 
 +1
 
> On Sep 15, 2016, at 01:13, Matthew Tovbin  wrote:
> 
> +1
> 
>> On Wed, Sep 14, 2016 at 10:12 AM, Pat Ferrel 
> wrote:
> 
>> +1
>> 
>> 
>> On Sep 13, 2016, at 11:55 AM, Donald Szeto 
 wrote:
>> 
>> This is the vote for 0.10.0 of Apache PredictionIO (incubating).
>> 
>> The vote will run for at least 72 hours and will close on Sept 16th,
 2016.
>> 
>> The artifacts can be downloaded here:
>> https://dist.apache.org/repos/dist/dev/incubator/predictioni
>> o/0.10.0-incubating-rc1/
>> or
>> from the Maven staging repo here:
>> https://repository.apache.org/content/repositories/orgapache
>> predictionio-1001/
>> 
>> All JIRAs completed for this release are tagged with 'FixVersion =
 0.10.0'.
>> You can view them here:
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
>> ctId=12320420=12337844
>> 
>> The artifacts have been signed with Key : 8BF4ABEB
>> 
>> Please vote accordingly:
>> 
>> [ ] +1, accept RC as the official 0.10.0 release
>> [ ] -1, do not accept RC as the official 0.10.0 release because...
>> 


[jira] [Commented] (PIO-30) Cross build for different versions of scala and spark

2016-09-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PIO-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495931#comment-15495931
 ] 

ASF GitHub Bot commented on PIO-30:
---

GitHub user Ziemin opened a pull request:

https://github.com/apache/incubator-predictionio/pull/295

[PIO-30] Set up a cross build for Scala 2.10 (Spark 1.6.2) and Scala …

This PR introduces a simple profile-based build of PredictionIO, which 
comes along with some upgrades including a new version of Spark. 

### The key changes include:

* build.sbt - here I created two profiles with different sets of artifacts 
to be included. Their names are _scala-2.11_  and _scala-2.10_, where the 
former is chosen by default. In order to set a desired profile for sbt command 
`-Dbuild.profile=` property has to be provided. 
You can print the description of some profile by e.g. 
```sbt -Dbuild.profile=scala-2.10 printProfile```
The _scala-2.11_ settings include Spark version 2.0.0, while _scala-2.11_ 
sets it to 1.6.2. This can be configured, by adding a dedicated property:
``` sbt -Dbuild.profile=scala-2.11 -Dspark.version=1.6.0 -D 
Dhadoop.version=2.6.4  ```
This command will set a build profile to _scala-2.11_, but will use a 
different versions of Spark and Hadoop. It makes the configuration more 
flexible, especially if someone wants to build the project according to their 
own needs. 
Very important thing to note is that versions of spark before 1.6.x are no 
longer supported and scala 2.10.x is deprecated. 

* 
`data/src/main/spark-1/org/apache/predictionio/data/SparkVersionDependent.scala`
 and 
`data/src/main/spark-2/org/apache/predictionio/data/SparkVersionDependent.scala`
 - are the only examples of version dependent code. They are solely for 
providing a proper type of an object for Spark sql related actions. Sbt is 
configured to include version specific source paths like these.

* make_distribution.sh - in order to create an archive for Scala 2.10 one 
has to provide it with an argument in the same way as sbt 
(./make_distribution.sh -Dbuild.profile=scala-2.10). By default it will build 
for _scala-2.11_.

* integration tests - The docker image is updated, I pushed it with a tag 
spark_2.0.0 not to interfere with the current build. It contains both versions 
of Spark and on startup sets up environment according to dependencies that 
predictionIO was built with. It uses a simple Java program 
`tests/docker-files/BuildInfoPrinter.java` linked with the assembly of 
PredictionIO to acquire necessary information. Travis CI makes use of the setup 
and runs 8 parallel builds, the number doubled because of introducing two 
different build profiles.

I also noticed that there are many places hardcoding different version 
numbers and links to packages to be downloaded. Maintaining the project and 
keeping everything consistent gets only more difficult, therefore I came up 
with a few small scripts getting proper versions of dependencies from the build 
config and setting some variables accordingly. An example is `conf/vendors.sh`, 
which provided that some variables are set (e.g. by `dev/set-build-profile.sh`) 
initializes some other useful variables, i.e. `SPARK_DOWNLOAD, SPARK_ARCHIVE, 
SPARK_DIRNAME`. They are used in travis configuration as well as in the 
Dockerfile, which now should be built with a dedicated script 
`tests/docker-build.sh`. Such approach makes it easier to keep everything 
coherent while bumping version numbers.

### Some problems encountered during upgrade
Updating Spark caused some troubles

* The classpath has to be extended to run the unit tests successfully for 
some sub-packages. (see `build.sbt`)
* Column names have to be handled differently for Postgres in JDBCPevents, 
as Spark surrounds them with "..." what breaks the current schema in this case
* `tests/pio_tests/utils.py` - has a special Spark pass through argument to 
set `spark.sql.warehouse.dir`, because the defaults cause runtime exceptions. 
See -> 
[here](https://mail-archives.apache.org/mod_mbox/spark-user/201608.mbox/%3ccamassd+efz+uscmnzvkfp00qbr9ynv8lrfhvz9lrmnwh2vk...@mail.gmail.com%3E)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Ziemin/incubator-predictionio upgrade

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-predictionio/pull/295.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #295


commit 39879f04639b123510f0668e0de7b33fd7418784
Author: Marcin Ziemiński 
Date:   2016-08-09T21:20:54Z

[PIO-30] Set up a cross build for Scala 2.10 (Spark 1.6.2) and Scala 2.11
(Spark 2.0.0).

Changes also include updating travis