Re: Kylin Performance

ShaoFeng Shi Mon, 26 Dec 2016 17:33:07 -0800

Alberto, I didn't test ORC format; but as you know, Kylin consumes the
source data row by row (all columns at once), so I guess columnar format
like ORC may not benefit much. But this is a good try, if there is better
format we can switch to it.

The "redistribute flat hive table" will add time but it can reduce time in
subsequent cube building (avoid data skew), especially when there are lots
of records. Usually it is fast (a couple minutes to ten or twenty minutes)
comparing to the cube build time. You mentioned it took 30% of total time,
what's the total time and what's the input number? When the input is small,
the overhead may overcome the benefit.

For the method you mentioned (count on fact table, then put the
redistribute to step 1), actually it is supported in Kylin 1.5.4 (maybe
also 1.5.3) with a config parameter; but that method is not recommended as
it is unstable: In some cases (e.g, the fact table is a big hive view, or
it is a big table but not partitioned by date), a simple "select count(*)
from fact_table" will cost lots of resources on Hadoop, a second "create
intermediate_table as select ..." will start the same mappers again.

In contrast, the as-is method is relatively stable for extreme case;
usually the intermediate table is much smaller than fact table, count and
redistribute on it will be low-cost; In next version there will be a
further optimization (https://issues.apache.org/jira/browse/KYLIN-2165) to
reduce the time in this step.

2016-12-27 1:20 GMT+08:00 Alberto Ramón <a.ramonporto...@gmail.com>:

> Hello
>
> from v0, I correct english sintaxis
>
>
> After tunning of cube:
>   -  Use Hive input compress table
>   -  Define  Hierarchy, Joint, Dim
>   -  . . .
>
> Now:  57% if for first steps (flat table, steps: 1,2,3)  and 43% for build
> cube
>
> I saw flat table uses SEQUENCEFILE, then I tested to use
>    ORC,
>    ORC + Snappy
>    ORC + Snappy + Vectorization
>
> without good results, more ideas ??
>
>
> I'm thinking that 'Redistribute Flat Hive Table' is a simple count and uses
>
> *30% of total time*
>   Is this the normal case ?
>   We can aprox this count to: count of Fact Table (Will true 99% of time),
> and put in // with step 1, is necessary be precise?
>
> 2016-12-22 14:00 GMT+01:00 Li Yang <liy...@apache.org>:
>
> > Very good work!
> >
> > Btw, we are also doing benchmarks on SSB and TPC-H data sets, based on
> > below work. Will share more info soon.
> >
> > - http://www.cs.umb.edu/~poneil/StarSchemaB.PDF
> > - https://github.com/hortonworks/hive-testbench
> >
> >
> > Cheers
> > Yang
> >
> > On Wed, Dec 21, 2016 at 8:45 PM, Alberto Ramón <
> a.ramonporto...@gmail.com>
> > wrote:
> >
> > > When Kylin 2149 <https://issues.apache.org/jira/browse/KYLIN-2149>
> will
> > be
> > > solved the performance will be* improve even more*, because:
> > >
> > > you know that 2016-05-05 Belongs to May, Week 18, and friday , but
> kylin
> > > doesnt know it
> > > It will try to calulate the combination of 2016-05-05 with January
> > February
> > > March, ... Monday Tuesday ..., W1 W2 ..., Q2 Q3 Q4 ==> There are a lot
> of
> > > combination wasted
> > >
> > > 2016-12-21 12:57 GMT+01:00 Luke_Selina <huangzhendon...@gmail.com>:
> > >
> > > > Great and Agree! But I still have an question like Alberto, why in an
> > AGG
> > > > one
> > > > dim can use only one regulation(mandatory, join, hierachy)?
> > > >
> > > > --
> > > > View this message in context: http://apache-kylin.74782.x6.
> > > > nabble.com/Kylin-Performance-tp6713p6728.html
> > > > Sent from the Apache Kylin mailing list archive at Nabble.com.
> > > >
> > >
> >
>

-- 
Best regards,

Shaofeng Shi 史少锋

Re: Kylin Performance

Reply via email to