A Spark cluster is only needed for `pio train`. Spark must be installed on the 
machine that runs `pio deploy` but is only used for local client APIs and never 
needs to communicate with the cluster.

However the last I checked EMR is will not work. EMR was designed for Hadoop 
Mapreduce and Spark does not use files for intermediate storage, it needs 
memory and lots of it. Also remember that that the machine that runs `pio 
train` is the Spark Driver machine and needs nearly the same resources (memory 
and cores) as a Spark Executor. The only way to run the Driver in EMR is using 
Yarn-cluster mode, and the last time I checked this was either impossible or 
very difficult. So we have never been able to use EMR.

For larger installations we (ActionML) do something very similar with scripts 
in Terraform. You can start all machines for Spark including a pre-configured 
pio train machine, then train, then stop them when training is done. This will 
insure you don’t pay for Spark when you aren’t using it.


On Jul 14, 2017, at 2:38 AM, Mattz <[email protected]> wrote:

Hello,

Is Spark required only for "PIO TRAIN" or is it needed for serving the 
recommendations as well? 

I am planning to run PredictionIO on AWS. So, thinking to run PredictionIO with 
Elastic search service and EMR. Wanted to know if we can use EMR only during 
the training phase and then serve the recommendations from another smaller 
instance running PredictionIO talking to the Elastic Search service. Is this 
possible?

Please let me know.

Thanks.

Reply via email to