Re: Data growth vs Cluster Size planning

2019-02-12 Thread Phillip Henry
Too little information to give an answer, if indeed an answer a priori is
possible.

However, I would do the following on your test instances:

- Run jstat -gc on all your nodes. It might be that the GC is taking a lot
of time.

- Poll with jstack semi frequently. I can give you a fairly good idea where
in the code the time is being spent in a non-invasive manner.

Phillip



On Mon, Feb 11, 2019 at 9:48 AM Aakash Basu 
wrote:

> Hi,
>
> I ran a dataset of *200 columns and 0.2M records* in a cluster of *1
> master 18 GB, 2 slaves 32 GB each, **16 cores/slave*, took around *772
> minutes* for a *very large ML tuning based job* (training).
>
> Now, my requirement is to run the *same operation on 3M records*. Any
> idea on how we should proceed? Should we go for a vertical scaling or a
> horizontal one? How should this problem be approached in a
> stepwise/systematic manner?
>
> Thanks in advance.
>
> Regards,
> Aakash.
>


Data growth vs Cluster Size planning

2019-02-11 Thread Aakash Basu
Hi,

I ran a dataset of *200 columns and 0.2M records* in a cluster of *1 master
18 GB, 2 slaves 32 GB each, **16 cores/slave*, took around *772 minutes*
for a *very large ML tuning based job* (training).

Now, my requirement is to run the *same operation on 3M records*. Any idea
on how we should proceed? Should we go for a vertical scaling or a
horizontal one? How should this problem be approached in a
stepwise/systematic manner?

Thanks in advance.

Regards,
Aakash.