Re: Eventserver API in an Engine?

2017-09-23 Thread Pat Ferrel
And glad you did.

The needs of Heroku are just as important as any user of an Apache project *but 
no more so* since one extremely important measure of TLP eligibility is to 
demonstrate freedom from corporate dominance.

So let me chime in with my own reasons to look at a major refactoring of PIO;
Simplify deployment, one server with integrated engine(s) all incorporated into 
a single REST API and a single JVM process (perhaps identical to what Mars is 
asking for)
No need to “train” or ‘deploy” on different machines but full access to 
clustered compute and storage services (also something Mars mentions)
Kappa, non-Spark-based Engines, pure clean REST API that allows GUIs to be 
plugged in, optional true security (SSL+Auth).
The ML/AI community is moving on from Hadoop Mapreduce, to Spark, to TensorFlow 
and Streaming online learners (Kappa) and this requires independence from any 
specific compute backend.
Multi-tenant, with multiple instances and types of Engines.
Secure, TLS + Authentications + Authorization but optionally done so no 
overhead when it isn’t needed.
The CLI is just another client communicating with the server’s REST API and can 
be replaced with custom admin GUIs, for example.

We now have an MVP that delivers the above requirements but as a replacement to 
PIO. We at first saw this as PIO-Kappa. Early code was named this. But things 
have changed since it requires some major re-thinking and so now has its own 
name—Harness. To get these features the re-thinking of the PIO codebase will 
also be required along with a *lot* of work to implement. We chose to start 
from scratch as an easier route. The sever has one JVM process with REST for 
all input and query endpoints and even methods to trigger training for Lambda 
Engines. We have benchmarked performance on our scaffold Template (minimal 
operational Engine) at 6ms/request for one user (connection) in one thread on a 
2013 Macbook Pro in localhost mode—add 1 ms for SSL+Auth. Since it uses 
akka-http is will also handle a self-tuning number of parallel requests (no 
benchmarks yet). So suffice to say it is fast.

Templates for this server are quite a bit different because they now include 
their own robust validation mechanism for input, query, and engine.json but 
also because Templates must now do some of what pio does. With this 
responsibility comes great freedom. Freedom to use any compute backend. Freedom 
to use any storage mechanism for model or input. Freedom to be Kappa, Lambda, 
or any hybrid between. And Engines get new functionality from the server as 
listed in the requirements.

Even though there are structural Template differences they remain JSON input 
compatible with PIO. We took a PIO Template we had created in 2016 that uses 
Vowpal Wabbit as a compute backend and re-implemented it in this new ML Server 
as a clean Kappa Template. Therefore we can talk about the differences with 
some evidence to back up statements. There was 0 change to input so backups of 
the PIO engine were moved to the new server quite easily with CLI and no change 
to data.

There are long tedious discussions that could be made about how to get what 
Mars and I are asking for from PIO but Apache is a do-ocracy. All of our asks 
can be done incrementally with incremental disruption—or they can be done at 
once (and have been). There are so many trade-offs that the discussion will, in 
all likelihood never end. 

I therefore suggest that Mars *do* what he thinks is needed, or alternatively, 
I am willing to donate what we have running. I’m planning to make the UR a 
Kappa algorithm soon, requiring no `pio train` (and no Spark). This must, of 
necessity be done on the new server framework so whether the new framework 
becomes part of PIO 2 or not, is a choice for the team. I suppose I could just 
push it to an “experimental” branch but this is something I’m not willing to 
*do* without some indication it is welcome. 

https://github.com/actionml/harness 
https://github.com/actionml/harness/blob/develop/commands.md 

https://github.com/actionml/harness/blob/develop/rest_spec.md 

Template contract: 
https://github.com/actionml/harness/tree/develop/rest-server/core/src/main/scala/com/actionml/core/template

The major downside I will volunteer is that Templates will require a fair bit 
of work to port and we have no Spark based ones to use as examples yet. Also we 
have not integrated PIO-Stores as the lead-in diagram implies. Remember it is 
an MVP running a Template in a production environment but makes no effort to 
replicate all PIO features.

 
On Sep 22, 2017, at 6:35 PM, Mars Hall  wrote:

I'm bringing this thread back to life!

There is another thread here this week:
How to training and deploy on different machine?

In it, Pat replies:

You will have to spread the pio “workflow

Re: Unable to connect to all storage backends successfully

2017-09-23 Thread Jim Miller
Hi Donald,

Tried just now and received the following error:

vagrant:~/ $ pio status                                                         
                                             [13:34:52]
[INFO] [Management$] Inspecting PredictionIO...
[INFO] [Management$] PredictionIO 0.12.0-incubating is installed at 
/home/vagrant/pio/PredictionIO-0.12.0-incubating
[INFO] [Management$] Inspecting Apache Spark...
[INFO] [Management$] Apache Spark is installed at 
/home/vagrant/pio/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.7
[INFO] [Management$] Apache Spark 2.1.1 detected (meets minimum requirement of 
1.3.0)
[INFO] [Management$] Inspecting storage backend connections...
[INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)...
[ERROR] [Management$] Unable to connect to all storage backends successfully.
The following shows the error message from the storage backend.

error while performing request (java.lang.RuntimeException)

Dumping configuration of initialized storage backend sources.
Please make sure they are correct.

Source Name: ELASTICSEARCH; Type: elasticsearch; Configuration: HOME -> 
/home/vagrant/pio/PredictionIO-0.12.0-incubating/vendors/elasticsearch-5.5.2, 
HOSTS -> localhost, PORTS -> 9300, CLUSTERNAME -> firstCluster, TYPE -> elastic 
search


HERE IS MY PIO-ENV.SH
# PredictionIO Main Configuration
#
# This section controls core behavior of PredictionIO. It is very likely that
# you need to change these to fit your site.

# SPARK_HOME: Apache Spark is a hard dependency and must be configured.
# SPARK_HOME=$PIO_HOME/vendors/spark-2.0.2-bin-hadoop2.7
SPARK_HOME=$PIO_HOME/vendors/spark-2.1.1-bin-hadoop2.7

POSTGRES_JDBC_DRIVER=$PIO_HOME/lib/postgresql-42.0.0.jar
MYSQL_JDBC_DRIVER=$PIO_HOME/lib/mysql-connector-java-5.1.41.jar

# ES_CONF_DIR: You must configure this if you have advanced configuration for
#              your Elasticsearch setup.
ES_CONF_DIR=$PIO_HOME/vendors/elasticsearch-5.5.2
# HADOOP_CONF_DIR=/opt/hadoop

# HBASE_CONF_DIR: You must configure this if you intend to run PredictionIO
#                 with HBase on a remote cluster.
HBASE_CONF_DIR=$PIO_HOME/hbase-1.3.1/conf

# Filesystem paths where PredictionIO uses as block storage.
PIO_FS_BASEDIR=$HOME/.pio_store
PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines
PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp

# PredictionIO Storage Configuration
#
# This section controls programs that make use of PredictionIO's built-in
# storage facilities. Default values are shown below.
#
# For more information on storage configuration please refer to
# http://predictionio.incubator.apache.org/system/anotherdatastore/

# Storage Repositories

# Default is to use PostgreSQL
PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH

PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS

# Storage Data Sources

# PostgreSQL Default Settings
# Please change "pio" to your database name in PIO_STORAGE_SOURCES_PGSQL_URL
# Please change PIO_STORAGE_SOURCES_PGSQL_USERNAME and
# PIO_STORAGE_SOURCES_PGSQL_PASSWORD accordingly
# PIO_STORAGE_SOURCES_PGSQL_TYPE=jdbc
# PIO_STORAGE_SOURCES_PGSQL_URL=jdbc:postgresql://localhost/pio
# PIO_STORAGE_SOURCES_PGSQL_USERNAME=pio
# PIO_STORAGE_SOURCES_PGSQL_PASSWORD=pio

# MySQL Example
# PIO_STORAGE_SOURCES_MYSQL_TYPE=jdbc
# PIO_STORAGE_SOURCES_MYSQL_URL=jdbc:mysql://localhost/pio
# PIO_STORAGE_SOURCES_MYSQL_USERNAME=pio
# PIO_STORAGE_SOURCES_MYSQL_PASSWORD=pio

# Elasticsearch Example
# PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
# PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
# PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9200
# PIO_STORAGE_SOURCES_ELASTICSEARCH_SCHEMES=http
# PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch-5.5.2
# Optional basic HTTP auth
# PIO_STORAGE_SOURCES_ELASTICSEARCH_USERNAME=my-name
# PIO_STORAGE_SOURCES_ELASTICSEARCH_PASSWORD=my-secret
# Elasticsearch 1.x Example
PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=firstCluster
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9300
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch-5.5.2

# Local File System Example
PIO_STORAGE_SOURCES_LOCALFS_TYPE=localfs
PIO_STORAGE_SOURCES_LOCALFS_PATH=$PIO_FS_BASEDIR/models

# HBase Example
PIO_STORAGE_SOURCES_HBASE_TYPE=hbase
PIO_STORAGE_SOURCES_HBASE_HOME=$PIO_HOME/vendors/hbase-1.3.1

# AWS S3 Example
# PIO_STORAGE_SOURCES_S3_TYPE=s3
# PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio_bucket
# PIO_STORAGE_SOURCES_S3_BASE_PATH=pio_model

ELASTICSEARCH.YML
# Elasticsearch Configuration =
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make