Re: How to estimate resource cost according to data scale?

Chase Zhang Wed, 15 Nov 2017 19:16:14 -0800

Thanks for your answer.

Auto scale sounds good, but it seems not quite fit in our demand because of the 
following reasons:

1. The auto scale function of EMR seems have no rule about task running time as 
the cluster may not be able to aware this from Kylin. But the build time has a 
hard baseline in our use case
2. We want to save money by replacing some instances by the so called Reserved 
Instance provided by AWS which cost less but have to plan for long run as the 
billing model on them does not as flexible as normal instances

We're wondering if the Kylin dev team or the community have related study. For 
example, we have already have a running task which will run in 10 minutes. What 
will be the build time if the scale of data gained by 10 times without changes 
in data's distribution. And, if we add more machines, how will the build time 
be reduced?

We think this estimation is very valuable because it can direct us on resource 
allocation and saving money. :p

Another alternative might be an article to provide a more detail description 
about the underlying algorithm by Kylin as currently it's more like a black box 
for us so that we cannot foresee the reaction from Kylin if one variable has 
changed.

On 16 Nov 2017, 10:46 AM +0800, ShaoFeng Shi <[email protected]>, wrote:
> Hi Chase,
>
> I see your Hadoop is AWS EMR; did you try EMR's auto-scaling rules?  Kylin 
> builds the cube on Hadoop in parallelly; If a big data set comes, Hadoop will 
> start more tasks than normal; If there are many pending tasks, EMR can detect 
> and then add new task nodes. This should help to improve the overall building 
> performance. But this may not as efficient as you expected (in 20 minutes).
>
> Is it possible to forecast a big data set will come and then call AWS API to 
> scale out the cluster? Besides, what's your build engine, MR or Spark? You 
> can switch to Spark to further reduce the building time.
>
>
> > 2017-11-14 16:29 GMT+08:00 Chase Zhang <[email protected]>:
> > > Hi all,
> > >
> > > This is Chase from Strikingly. Recently we're confronted with one problem 
> > > upon the usage of Apache Kylin. Here is the description. Hoping anyone 
> > > here could give some suggestions :)
> > >
> > > The problem is about the estimation of resource and time cost for one 
> > > build of cube in proportion to data scale.
> > >
> > > Currently we have a task which will be triggered once per hour and the 
> > > cube build will averagely cost 7-10 minutes or so. Per our business's 
> > > growth, we need to plan an up scaling for our data platform in case the 
> > > build time becomes too long.
> > >
> > > Thus, we're wondering if there is a good way to forecast the resource 
> > > required to keep the same task's build time under 20 minutes if the data 
> > > scale has enlarged, for example, 100 times. As we are not familiar to the 
> > > underlying algorithm of Kylin, we're not sure how will Kylin actually 
> > > perform upon our dataset.
> > >
> > > Do the develop team and other users in community have any experience or 
> > > suggestions for this? Is there any articles for this specific problem?
> > >
> > >
>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>

Re: How to estimate resource cost according to data scale?

Reply via email to