I'm using dataframes, types are all doubles and I'm only extracting what I need.
The caveat on these is that I am porting an existing system for a client and for there business it's likely to be cheaper to throw hardware (in aws) at the problem for a couple of hours than re-engineer there algorithms cheers On 7 June 2016 at 21:54, Jörn Franke <jornfra...@gmail.com> wrote: > Before hardware optimization there is always software optimization. > Are you using dataset / dataframe? Are you using the right data types ( > eg int where int is appropriate , try to avoid string and char etc) > Do you extract only the stuff needed? What are the algorithm parameters? > > > On 07 Jun 2016, at 13:09, Franc Carter <franc.car...@gmail.com> wrote: > > > > > > Hi, > > > > I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and > am interested in how it might be best to scale it - e.g more cpus per > instances, more memory per instance, more instances etc. > > > > I'm currently using 32 m3.xlarge instances for for a training set with > 2.5 million rows, 1300 columns and a total size of 31GB (parquet) > > > > thanks > > > > -- > > Franc > -- Franc