I'm using dataframes, types are all doubles and I'm only extracting what I
need.

The caveat on these is that I am porting an existing system for a client
and for there business it's likely to be cheaper to throw hardware (in aws)
at the problem for a couple of hours than re-engineer there algorithms

cheers


On 7 June 2016 at 21:54, Jörn Franke <jornfra...@gmail.com> wrote:

> Before hardware optimization there is always software optimization.
> Are you using dataset / dataframe? Are you using the  right data types (
> eg int where int is appropriate , try to avoid string and char etc)
> Do you extract only the stuff needed? What are the algorithm parameters?
>
> > On 07 Jun 2016, at 13:09, Franc Carter <franc.car...@gmail.com> wrote:
> >
> >
> > Hi,
> >
> > I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and
> am interested in how it might be best to scale it - e.g more cpus per
> instances, more memory per instance, more instances etc.
> >
> > I'm currently using 32 m3.xlarge instances for for a training set with
> 2.5 million rows, 1300 columns and a total size of 31GB (parquet)
> >
> > thanks
> >
> > --
> > Franc
>



-- 
Franc

Reply via email to