Re: How to training and deploy on different machine?

2017-09-20 Thread Brian Chiu
Dear Pat,

Thanks for the detailed guide.  It is nice to know it is possible.
But I am not sure if I understand it correctly, so could you please
point out any misunderstanding in the following?  (If there is any)


Let's say I have 3 machines.

There is a machine [EventServer and data store) for ES, HBase+HDFS (or
Postgres, but not recommended)
The other 2 machines will both connect to this machine.
It is permanent.

machine [TrainingServer] will run `pio build` and `pio train`
This step pull training data from [EventServer] and then store model
and metadata back,
It is not permanent.

machine [PredictionServer] gets a copy of the template from machine
[TrainingServer] (only need to do this once)
Then run `pio deploy`
It is not a Spark driver or executor for training
Write a cron job of `pio deploy`
It is permanent.


Thanks

Brian

On Wed, Sep 20, 2017 at 11:16 PM, Pat Ferrel  wrote:
> Yes, this is the recommended config (Postgres is not, but later). Spark is
> only needed during training but the `pio train` process creates drives and
> executors in Spark. The driver will be the `pio train` machine so you must
> install pio on it. You should have 2 Spark machines at least because the
> driver and executor need roughly the same memory, more executors will train
> faster.
>
> You will have to spread the pio “workflow” out over a permanent
> deploy+eventserver machine. I usually call this a combo PredictionServer and
> EventServe. These are 2 JVM processes the take events and respond to queries
> and so must be available all the time. You will run `pio eventserver` and
> `pio deploy` on this machine. the Spark driver machine will run `pio train`.
> Since no state is stored in PIO this will work because the machines get
> state from the DBs (HBase is recommended, and Elasticsearch). Install pio
> and the UR in the same location on all machines because the path to the UR
> is used by PIO to give an id to the engine (not ideal, but oh well).
>
> Once setup:
>
> Run `pio eventserver` on the permanent PS/ES machine and input your data
> into the EventServer.
> Run `pio build` on the “driver” machine and `pio train` on the same machine.
> This build the UR, puts metadata about the instance in PIO and creates the
> Spark driver, which can use a separate machine or 3 as Spark executors.
> Then copy the UR directory to the PS/ES machine and do `pio deploy` from the
> copied directory.
> Shut down the driver machine and Spark executors. For AWS “stopping" them
> means config is saved so you only pay for EBS storage. You will start them
> before the next train.
>
>
> From then on there is no need to copy the UR directory, just spin up the
> driver and any other Spark machine, do `pio train` and you are done. The
> model is automatically hot-swapped with the old one with no downtime and no
> need to re-deploy.
>
> This will only work in this order if you want to take advantage of a
> temporary Spark. PIO is installed on the PS/ES machine and the “driver”
> machine in exactly the same way connecting to the same stores.
>
> Hmm, I should write a How to for this...
>
>
>
> On Sep 20, 2017, at 3:23 AM, Brian Chiu  wrote:
>
> Hi,
>
> I would like to be able to train and run model on different machines.
> The reason is, on my dataset, training takes around 16GB of memory and
> deploying only needs 8GB.  In order to save money, it would be better
> if only a 8GB memory machine is used in production, and only start a
> 16GB one perhaps weekly for training.  Is it possible with
> predictionIO + universal recommender?
>
> I have done some search and found a related guide here:
> https://github.com/actionml/docs.actionml.com/blob/master/pio_load_balancing.md
> Which copy the whole template directory and then run pio deploy.  But
> in their case HBase and elasticsearch cluster are used.  In my case
> only a single machine is used with elasticsearch and postgresql.  Will
> this work?  (I am flexible about using postresql or localfs or hbase,
> but I cannot afford a cluster)
>
> Perhaps another solution to make the 16GB machine as a spark slave,
> start it before training start, and the 8GB machine will connect to
> it. Then call pio train; pio deploy on the 8GB machine.  Finally
> shutdown the 16GB machine.  But I have no idea if it can work.  And if
> yes, is there any documentation I can look into?
>
> Any other method is welcome!  Zero downtime is preferred but not necessary.
>
> Thanks in advance.
>
>
> Best Regards,
> Brian
>


Re: Unable to connect to all storage backends successfully

2017-09-20 Thread Jim Miller
Hi Donald,

I did not.  I will read the release notes and update accordingly.

Thanks!

Jim

-- 
Jim Miller

On September 20, 2017 at 1:01:53 PM, Donald Szeto (don...@apache.org) wrote:

Hey Jim,

Did you build PIO 0.12 with ES 1.4 support? ES 1.x is being deprecated in 0.12 
so the default build will use ES 5.x support.

See the upcoming release notes: 
https://github.com/apache/incubator-predictionio/blob/release/0.12.0/RELEASE.md#behavior-changes

Regards,
Donald

On Wed, Sep 20, 2017 at 8:14 AM, Jim Miller  wrote:
Yes.  I’m following this tutorial but using 0.12.0 instead of 0.10.0:
https://medium.freecodecamp.org/building-an-recommendation-engine-with-apache-prediction-io-ml-server-aed0319e0d8

-- 
Jim Miller

On September 20, 2017 at 10:51:39 AM, Pat Ferrel (p...@occamsmachete.com) wrote:

meaning is “firstcluster” the cluster name in your Elasticsearch configuration?


On Sep 19, 2017, at 8:54 PM, Vaghawan Ojha  wrote:

I think the problem is with Elasticsearch, are you sure the cluster exists in 
elasticsearch configuration? 

On Wed, Sep 20, 2017 at 8:17 AM, Jim Miller  wrote:
Hi,

I’m using PredictionIO 0.12.0-incubating with ElasticSearch and Hbase:
PredictionIO-0.12.0-incubating/vendors/elasticsearch-1.4.4
PredictionIO-0.12.0-incubating/vendors/hbase-1.0.0
PredictionIO-0.12.0-incubating/vendors/spark-1.5.1-bin-hadoop2.6

All starts with no errors but with pio status I get:

[INFO] [Management$] Inspecting PredictionIO...
[INFO] [Management$] PredictionIO 0.12.0-incubating is installed at 
/home/vagrant/pio/PredictionIO-0.12.0-incubating
[INFO] [Management$] Inspecting Apache Spark...
[INFO] [Management$] Apache Spark is installed at 
/home/vagrant/pio/PredictionIO-0.12.0-incubating/vendors/spark-1.5.1-bin-hadoop2.6
[INFO] [Management$] Apache Spark 1.5.1 detected (meets minimum requirement of 
1.3.0)
[INFO] [Management$] Inspecting storage backend connections...
[INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)...
[ERROR] [Management$] Unable to connect to all storage backends successfully.
The following shows the error message from the storage backend.

Connection closed 
(org.apache.predictionio.shaded.org.apache.http.ConnectionClosedException)

Dumping configuration of initialized storage backend sources.
Please make sure they are correct.

Source Name: ELASTICSEARCH; Type: elasticsearch; Configuration: HOME -> 
/home/vagrant/pio/PredictionIO-0.12.0-incubating/vendors/elasticsearch-1.4.4, 
HOSTS -> localhost, PORTS -> 9300, CLUSTERNAME -> firstcluster, TYPE -> 
elasticsearch

Can anyone give me an idea of what I need to fix this issue?  Here is 

# PredictionIO Main Configuration
#
# This section controls core behavior of PredictionIO. It is very likely that
# you need to change these to fit your site.

# SPARK_HOME: Apache Spark is a hard dependency and must be configured.
# SPARK_HOME=$PIO_HOME/vendors/spark-2.0.2-bin-hadoop2.7
SPARK_HOME=$PIO_HOME/vendors/spark-1.5.1-bin-hadoop2.6

POSTGRES_JDBC_DRIVER=$PIO_HOME/lib/postgresql-42.0.0.jar
MYSQL_JDBC_DRIVER=$PIO_HOME/lib/mysql-connector-java-5.1.41.jar

# ES_CONF_DIR: You must configure this if you have advanced configuration for
#              your Elasticsearch setup.
ES_CONF_DIR=$PIO_HOME/vendors/elasticsearch-1.4.4/conf

# HADOOP_CONF_DIR: You must configure this if you intend to run PredictionIO
#                  with Hadoop 2.
HADOOP_CONF_DIR=$PIO_HOME/vendors/spark-1.5.1-bin-hadoop2.6/conf

# HBASE_CONF_DIR: You must configure this if you intend to run PredictionIO
#                 with HBase on a remote cluster.
HBASE_CONF_DIR=$PIO_HOME/vendors/hbase-1.0.0/conf

# Filesystem paths where PredictionIO uses as block storage.
PIO_FS_BASEDIR=$HOME/.pio_store
PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines
PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp

# PredictionIO Storage Configuration
#
# This section controls programs that make use of PredictionIO's built-in
# storage facilities. Default values are shown below.
#
# For more information on storage configuration please refer to
# http://predictionio.incubator.apache.org/system/anotherdatastore/

# Storage Repositories

# Default is to use PostgreSQL
PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH

PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS

# Storage Data Sources

# PostgreSQL Default Settings
# Please change "pio" to your database name in PIO_STORAGE_SOURCES_PGSQL_URL
# Please change PIO_STORAGE_SOURCES_PGSQL_USERNAME and
# PIO_STORAGE_SOURCES_PGSQL_PASSWORD accordingly
# PIO_STORAGE_SOURCES_PGSQL_TYPE=jdbc
# PIO_STORAGE_SOURCES_PGSQL_URL=jdbc:postgresql://localhost/pio
# PIO_STORAGE_SOURCES_PGSQL_USERNAME=pio
# PIO_STORAGE_SOURCES_PGSQL_PASSWORD=pio

# MySQL Example
# PIO_STORAGE_SOURCES_MYSQL_TYPE=jdbc
# PIO_STORAGE_SOURCES_MYSQL_URL=jdbc:mysql://localhost/pio
# PIO_STORA

Re: Unable to connect to all storage backends successfully

2017-09-20 Thread Donald Szeto
Hey Jim,

Did you build PIO 0.12 with ES 1.4 support? ES 1.x is being deprecated in
0.12 so the default build will use ES 5.x support.

See the upcoming release notes:
https://github.com/apache/incubator-predictionio/blob/release/0.12.0/RELEASE.md#behavior-changes

Regards,
Donald

On Wed, Sep 20, 2017 at 8:14 AM, Jim Miller  wrote:

> Yes.  I’m following this tutorial but using 0.12.0 instead of 0.10.0:
> https://medium.freecodecamp.org/building-an-recommendation-engine-with-
> apache-prediction-io-ml-server-aed0319e0d8
>
> --
> Jim Miller
>
> On September 20, 2017 at 10:51:39 AM, Pat Ferrel (p...@occamsmachete.com)
> wrote:
>
> meaning is “firstcluster” the cluster name in your Elasticsearch
> configuration?
>
>
> On Sep 19, 2017, at 8:54 PM, Vaghawan Ojha  wrote:
>
> I think the problem is with Elasticsearch, are you sure the cluster exists
> in elasticsearch configuration?
>
> On Wed, Sep 20, 2017 at 8:17 AM, Jim Miller 
> wrote:
>
>> Hi,
>>
>> I’m using PredictionIO 0.12.0-incubating with ElasticSearch and Hbase:
>>
>> PredictionIO-0.12.0-incubating/vendors/elasticsearch-1.4.4
>> PredictionIO-0.12.0-incubating/vendors/hbase-1.0.0
>> PredictionIO-0.12.0-incubating/vendors/spark-1.5.1-bin-hadoop2.6
>>
>>
>> All starts with no errors but with pio status I get:
>>
>> [INFO] [Management$] Inspecting PredictionIO...
>> [INFO] [Management$] PredictionIO 0.12.0-incubating is installed at
>> /home/vagrant/pio/PredictionIO-0.12.0-incubating
>> [INFO] [Management$] Inspecting Apache Spark...
>> [INFO] [Management$] Apache Spark is installed at
>> /home/vagrant/pio/PredictionIO-0.12.0-incubating/vendors/
>> spark-1.5.1-bin-hadoop2.6
>> [INFO] [Management$] Apache Spark 1.5.1 detected (meets minimum
>> requirement of 1.3.0)
>> [INFO] [Management$] Inspecting storage backend connections...
>> [INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)...
>> [ERROR] [Management$] Unable to connect to all storage backends
>> successfully.
>> The following shows the error message from the storage backend.
>>
>> Connection closed (org.apache.predictionio.shade
>> d.org.apache.http.ConnectionClosedException)
>>
>> Dumping configuration of initialized storage backend sources.
>> Please make sure they are correct.
>>
>> Source Name: ELASTICSEARCH; Type: elasticsearch; Configuration: HOME ->
>> /home/vagrant/pio/PredictionIO-0.12.0-incubating/vendors/elasticsearch-1.4.4,
>> HOSTS -> localhost, PORTS -> 9300, CLUSTERNAME -> firstcluster, TYPE ->
>> elasticsearch
>>
>>
>> Can anyone give me an idea of what I need to fix this issue?  Here is
>>
>>
>> # PredictionIO Main Configuration
>> #
>> # This section controls core behavior of PredictionIO. It is very likely
>> that
>> # you need to change these to fit your site.
>>
>> # SPARK_HOME: Apache Spark is a hard dependency and must be configured.
>> # SPARK_HOME=$PIO_HOME/vendors/spark-2.0.2-bin-hadoop2.7
>> SPARK_HOME=$PIO_HOME/vendors/spark-1.5.1-bin-hadoop2.6
>>
>> POSTGRES_JDBC_DRIVER=$PIO_HOME/lib/postgresql-42.0.0.jar
>> MYSQL_JDBC_DRIVER=$PIO_HOME/lib/mysql-connector-java-5.1.41.jar
>>
>> # ES_CONF_DIR: You must configure this if you have advanced configuration
>> for
>> #  your Elasticsearch setup.
>> ES_CONF_DIR=$PIO_HOME/vendors/elasticsearch-1.4.4/conf
>>
>> # HADOOP_CONF_DIR: You must configure this if you intend to run
>> PredictionIO
>> #  with Hadoop 2.
>> HADOOP_CONF_DIR=$PIO_HOME/vendors/spark-1.5.1-bin-hadoop2.6/conf
>>
>> # HBASE_CONF_DIR: You must configure this if you intend to run
>> PredictionIO
>> # with HBase on a remote cluster.
>> HBASE_CONF_DIR=$PIO_HOME/vendors/hbase-1.0.0/conf
>>
>> # Filesystem paths where PredictionIO uses as block storage.
>> PIO_FS_BASEDIR=$HOME/.pio_store
>> PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines
>> PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp
>>
>> # PredictionIO Storage Configuration
>> #
>> # This section controls programs that make use of PredictionIO's built-in
>> # storage facilities. Default values are shown below.
>> #
>> # For more information on storage configuration please refer to
>> # http://predictionio.incubator.apache.org/system/anotherdatastore/
>>
>> # Storage Repositories
>>
>> # Default is to use PostgreSQL
>> PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
>> PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH
>>
>> PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
>> PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE
>>
>> PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
>> PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS
>>
>> # Storage Data Sources
>>
>> # PostgreSQL Default Settings
>> # Please change "pio" to your database name in
>> PIO_STORAGE_SOURCES_PGSQL_URL
>> # Please change PIO_STORAGE_SOURCES_PGSQL_USERNAME and
>> # PIO_STORAGE_SOURCES_PGSQL_PASSWORD accordingly
>> # PIO_STORAGE_SOURCES_PGSQL_TYPE=jdbc
>> # PIO_STORAGE_SOURCES_PGSQL_URL=jdbc:postgresql://localhost/pio
>> # PIO_STORAGE_SOURCES_PGSQL_USERNAME=pio
>> # PIO_STORAGE_SOURCES_PGSQL_PAS

Re: How to training and deploy on different machine?

2017-09-20 Thread Pat Ferrel
Yes, this is the recommended config (Postgres is not, but later). Spark is only 
needed during training but the `pio train` process creates drives and executors 
in Spark. The driver will be the `pio train` machine so you must install pio on 
it. You should have 2 Spark machines at least because the driver and executor 
need roughly the same memory, more executors will train faster.

You will have to spread the pio “workflow” out over a permanent 
deploy+eventserver machine. I usually call this a combo PredictionServer and 
EventServe. These are 2 JVM processes the take events and respond to queries 
and so must be available all the time. You will run `pio eventserver` and `pio 
deploy` on this machine. the Spark driver machine will run `pio train`. Since 
no state is stored in PIO this will work because the machines get state from 
the DBs (HBase is recommended, and Elasticsearch). Install pio and the UR in 
the same location on all machines because the path to the UR is used by PIO to 
give an id to the engine (not ideal, but oh well). 

Once setup:
Run `pio eventserver` on the permanent PS/ES machine and input your data into 
the EventServer.
Run `pio build` on the “driver” machine and `pio train` on the same machine. 
This build the UR, puts metadata about the instance in PIO and creates the 
Spark driver, which can use a separate machine or 3 as Spark executors.
Then copy the UR directory to the PS/ES machine and do `pio deploy` from the 
copied directory.
Shut down the driver machine and Spark executors. For AWS “stopping" them means 
config is saved so you only pay for EBS storage. You will start them before the 
next train.

From then on there is no need to copy the UR directory, just spin up the driver 
and any other Spark machine, do `pio train` and you are done. The model is 
automatically hot-swapped with the old one with no downtime and no need to 
re-deploy.

This will only work in this order if you want to take advantage of a temporary 
Spark. PIO is installed on the PS/ES machine and the “driver” machine in 
exactly the same way connecting to the same stores.

Hmm, I should write a How to for this...



On Sep 20, 2017, at 3:23 AM, Brian Chiu  wrote:

Hi,

I would like to be able to train and run model on different machines.
The reason is, on my dataset, training takes around 16GB of memory and
deploying only needs 8GB.  In order to save money, it would be better
if only a 8GB memory machine is used in production, and only start a
16GB one perhaps weekly for training.  Is it possible with
predictionIO + universal recommender?

I have done some search and found a related guide here:
https://github.com/actionml/docs.actionml.com/blob/master/pio_load_balancing.md
Which copy the whole template directory and then run pio deploy.  But
in their case HBase and elasticsearch cluster are used.  In my case
only a single machine is used with elasticsearch and postgresql.  Will
this work?  (I am flexible about using postresql or localfs or hbase,
but I cannot afford a cluster)

Perhaps another solution to make the 16GB machine as a spark slave,
start it before training start, and the 8GB machine will connect to
it. Then call pio train; pio deploy on the 8GB machine.  Finally
shutdown the 16GB machine.  But I have no idea if it can work.  And if
yes, is there any documentation I can look into?

Any other method is welcome!  Zero downtime is preferred but not necessary.

Thanks in advance.


Best Regards,
Brian



Re: Unable to connect to all storage backends successfully

2017-09-20 Thread Jim Miller
Yes.  I’m following this tutorial but using 0.12.0 instead of 0.10.0:
https://medium.freecodecamp.org/building-an-recommendation-engine-with-apache-prediction-io-ml-server-aed0319e0d8

-- 
Jim Miller

On September 20, 2017 at 10:51:39 AM, Pat Ferrel (p...@occamsmachete.com) wrote:

meaning is “firstcluster” the cluster name in your Elasticsearch configuration?


On Sep 19, 2017, at 8:54 PM, Vaghawan Ojha  wrote:

I think the problem is with Elasticsearch, are you sure the cluster exists in 
elasticsearch configuration? 

On Wed, Sep 20, 2017 at 8:17 AM, Jim Miller  wrote:
Hi,

I’m using PredictionIO 0.12.0-incubating with ElasticSearch and Hbase:
PredictionIO-0.12.0-incubating/vendors/elasticsearch-1.4.4
PredictionIO-0.12.0-incubating/vendors/hbase-1.0.0
PredictionIO-0.12.0-incubating/vendors/spark-1.5.1-bin-hadoop2.6

All starts with no errors but with pio status I get:

[INFO] [Management$] Inspecting PredictionIO...
[INFO] [Management$] PredictionIO 0.12.0-incubating is installed at 
/home/vagrant/pio/PredictionIO-0.12.0-incubating
[INFO] [Management$] Inspecting Apache Spark...
[INFO] [Management$] Apache Spark is installed at 
/home/vagrant/pio/PredictionIO-0.12.0-incubating/vendors/spark-1.5.1-bin-hadoop2.6
[INFO] [Management$] Apache Spark 1.5.1 detected (meets minimum requirement of 
1.3.0)
[INFO] [Management$] Inspecting storage backend connections...
[INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)...
[ERROR] [Management$] Unable to connect to all storage backends successfully.
The following shows the error message from the storage backend.

Connection closed 
(org.apache.predictionio.shaded.org.apache.http.ConnectionClosedException)

Dumping configuration of initialized storage backend sources.
Please make sure they are correct.

Source Name: ELASTICSEARCH; Type: elasticsearch; Configuration: HOME -> 
/home/vagrant/pio/PredictionIO-0.12.0-incubating/vendors/elasticsearch-1.4.4, 
HOSTS -> localhost, PORTS -> 9300, CLUSTERNAME -> firstcluster, TYPE -> 
elasticsearch

Can anyone give me an idea of what I need to fix this issue?  Here is 

# PredictionIO Main Configuration
#
# This section controls core behavior of PredictionIO. It is very likely that
# you need to change these to fit your site.

# SPARK_HOME: Apache Spark is a hard dependency and must be configured.
# SPARK_HOME=$PIO_HOME/vendors/spark-2.0.2-bin-hadoop2.7
SPARK_HOME=$PIO_HOME/vendors/spark-1.5.1-bin-hadoop2.6

POSTGRES_JDBC_DRIVER=$PIO_HOME/lib/postgresql-42.0.0.jar
MYSQL_JDBC_DRIVER=$PIO_HOME/lib/mysql-connector-java-5.1.41.jar

# ES_CONF_DIR: You must configure this if you have advanced configuration for
#              your Elasticsearch setup.
ES_CONF_DIR=$PIO_HOME/vendors/elasticsearch-1.4.4/conf

# HADOOP_CONF_DIR: You must configure this if you intend to run PredictionIO
#                  with Hadoop 2.
HADOOP_CONF_DIR=$PIO_HOME/vendors/spark-1.5.1-bin-hadoop2.6/conf

# HBASE_CONF_DIR: You must configure this if you intend to run PredictionIO
#                 with HBase on a remote cluster.
HBASE_CONF_DIR=$PIO_HOME/vendors/hbase-1.0.0/conf

# Filesystem paths where PredictionIO uses as block storage.
PIO_FS_BASEDIR=$HOME/.pio_store
PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines
PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp

# PredictionIO Storage Configuration
#
# This section controls programs that make use of PredictionIO's built-in
# storage facilities. Default values are shown below.
#
# For more information on storage configuration please refer to
# http://predictionio.incubator.apache.org/system/anotherdatastore/

# Storage Repositories

# Default is to use PostgreSQL
PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH

PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS

# Storage Data Sources

# PostgreSQL Default Settings
# Please change "pio" to your database name in PIO_STORAGE_SOURCES_PGSQL_URL
# Please change PIO_STORAGE_SOURCES_PGSQL_USERNAME and
# PIO_STORAGE_SOURCES_PGSQL_PASSWORD accordingly
# PIO_STORAGE_SOURCES_PGSQL_TYPE=jdbc
# PIO_STORAGE_SOURCES_PGSQL_URL=jdbc:postgresql://localhost/pio
# PIO_STORAGE_SOURCES_PGSQL_USERNAME=pio
# PIO_STORAGE_SOURCES_PGSQL_PASSWORD=pio

# MySQL Example
# PIO_STORAGE_SOURCES_MYSQL_TYPE=jdbc
# PIO_STORAGE_SOURCES_MYSQL_URL=jdbc:mysql://localhost/pio
# PIO_STORAGE_SOURCES_MYSQL_USERNAME=pio
# PIO_STORAGE_SOURCES_MYSQL_PASSWORD=pio

# Elasticsearch Example
# PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
# PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
# PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9200
# PIO_STORAGE_SOURCES_ELASTICSEARCH_SCHEMES=http
# PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch-5.5.2
# Optional basic HTTP auth
# PIO_STORAGE_SOURCES_ELASTICSEARCH_USERNAME=my-name
# PIO_STORAGE_SOURCES_ELASTICSEARCH_PASSWORD=my-secret
# Elasticse

Re: Unable to connect to all storage backends successfully

2017-09-20 Thread Pat Ferrel
meaning is “firstcluster” the cluster name in your Elasticsearch configuration?


On Sep 19, 2017, at 8:54 PM, Vaghawan Ojha  wrote:

I think the problem is with Elasticsearch, are you sure the cluster exists in 
elasticsearch configuration? 

On Wed, Sep 20, 2017 at 8:17 AM, Jim Miller mailto:jemiller1...@gmail.com>> wrote:
Hi,

I’m using PredictionIO 0.12.0-incubating with ElasticSearch and Hbase:
PredictionIO-0.12.0-incubating/vendors/elasticsearch-1.4.4
PredictionIO-0.12.0-incubating/vendors/hbase-1.0.0
PredictionIO-0.12.0-incubating/vendors/spark-1.5.1-bin-hadoop2.6

All starts with no errors but with pio status I get:

[INFO] [Management$] Inspecting PredictionIO...
[INFO] [Management$] PredictionIO 0.12.0-incubating is installed at 
/home/vagrant/pio/PredictionIO-0.12.0-incubating
[INFO] [Management$] Inspecting Apache Spark...
[INFO] [Management$] Apache Spark is installed at 
/home/vagrant/pio/PredictionIO-0.12.0-incubating/vendors/spark-1.5.1-bin-hadoop2.6
[INFO] [Management$] Apache Spark 1.5.1 detected (meets minimum requirement of 
1.3.0)
[INFO] [Management$] Inspecting storage backend connections...
[INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)...
[ERROR] [Management$] Unable to connect to all storage backends successfully.
The following shows the error message from the storage backend.

Connection closed 
(org.apache.predictionio.shaded.org.apache.http.ConnectionClosedException)

Dumping configuration of initialized storage backend sources.
Please make sure they are correct.

Source Name: ELASTICSEARCH; Type: elasticsearch; Configuration: HOME -> 
/home/vagrant/pio/PredictionIO-0.12.0-incubating/vendors/elasticsearch-1.4.4, 
HOSTS -> localhost, PORTS -> 9300, CLUSTERNAME -> firstcluster, TYPE -> 
elasticsearch

Can anyone give me an idea of what I need to fix this issue?  Here is 

# PredictionIO Main Configuration
#
# This section controls core behavior of PredictionIO. It is very likely that
# you need to change these to fit your site.

# SPARK_HOME: Apache Spark is a hard dependency and must be configured.
# SPARK_HOME=$PIO_HOME/vendors/spark-2.0.2-bin-hadoop2.7
SPARK_HOME=$PIO_HOME/vendors/spark-1.5.1-bin-hadoop2.6

POSTGRES_JDBC_DRIVER=$PIO_HOME/lib/postgresql-42.0.0.jar
MYSQL_JDBC_DRIVER=$PIO_HOME/lib/mysql-connector-java-5.1.41.jar

# ES_CONF_DIR: You must configure this if you have advanced configuration for
#  your Elasticsearch setup.
ES_CONF_DIR=$PIO_HOME/vendors/elasticsearch-1.4.4/conf

# HADOOP_CONF_DIR: You must configure this if you intend to run PredictionIO
#  with Hadoop 2.
HADOOP_CONF_DIR=$PIO_HOME/vendors/spark-1.5.1-bin-hadoop2.6/conf

# HBASE_CONF_DIR: You must configure this if you intend to run PredictionIO
# with HBase on a remote cluster.
HBASE_CONF_DIR=$PIO_HOME/vendors/hbase-1.0.0/conf

# Filesystem paths where PredictionIO uses as block storage.
PIO_FS_BASEDIR=$HOME/.pio_store
PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines
PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp

# PredictionIO Storage Configuration
#
# This section controls programs that make use of PredictionIO's built-in
# storage facilities. Default values are shown below.
#
# For more information on storage configuration please refer to
# http://predictionio.incubator.apache.org/system/anotherdatastore/ 


# Storage Repositories

# Default is to use PostgreSQL
PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH

PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS

# Storage Data Sources

# PostgreSQL Default Settings
# Please change "pio" to your database name in PIO_STORAGE_SOURCES_PGSQL_URL
# Please change PIO_STORAGE_SOURCES_PGSQL_USERNAME and
# PIO_STORAGE_SOURCES_PGSQL_PASSWORD accordingly
# PIO_STORAGE_SOURCES_PGSQL_TYPE=jdbc
# PIO_STORAGE_SOURCES_PGSQL_URL=jdbc:postgresql://localhost/pio
# PIO_STORAGE_SOURCES_PGSQL_USERNAME=pio
# PIO_STORAGE_SOURCES_PGSQL_PASSWORD=pio

# MySQL Example
# PIO_STORAGE_SOURCES_MYSQL_TYPE=jdbc
# PIO_STORAGE_SOURCES_MYSQL_URL=jdbc:mysql://localhost/pio
# PIO_STORAGE_SOURCES_MYSQL_USERNAME=pio
# PIO_STORAGE_SOURCES_MYSQL_PASSWORD=pio

# Elasticsearch Example
# PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
# PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
# PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9200
# PIO_STORAGE_SOURCES_ELASTICSEARCH_SCHEMES=http
# PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch-5.5.2
# Optional basic HTTP auth
# PIO_STORAGE_SOURCES_ELASTICSEARCH_USERNAME=my-name
# PIO_STORAGE_SOURCES_ELASTICSEARCH_PASSWORD=my-secret
# Elasticsearch 1.x Example
PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=firstcluster
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
PIO_

How to training and deploy on different machine?

2017-09-20 Thread Brian Chiu
Hi,

I would like to be able to train and run model on different machines.
The reason is, on my dataset, training takes around 16GB of memory and
deploying only needs 8GB.  In order to save money, it would be better
if only a 8GB memory machine is used in production, and only start a
16GB one perhaps weekly for training.  Is it possible with
predictionIO + universal recommender?

I have done some search and found a related guide here:
https://github.com/actionml/docs.actionml.com/blob/master/pio_load_balancing.md
Which copy the whole template directory and then run pio deploy.  But
in their case HBase and elasticsearch cluster are used.  In my case
only a single machine is used with elasticsearch and postgresql.  Will
this work?  (I am flexible about using postresql or localfs or hbase,
but I cannot afford a cluster)

Perhaps another solution to make the 16GB machine as a spark slave,
start it before training start, and the 8GB machine will connect to
it. Then call pio train; pio deploy on the 8GB machine.  Finally
shutdown the 16GB machine.  But I have no idea if it can work.  And if
yes, is there any documentation I can look into?

Any other method is welcome!  Zero downtime is preferred but not necessary.

Thanks in advance.


Best Regards,
Brian