Re: Question about cube size estimation in Kylin 1.5

2016-04-26 Thread ShaoFeng Shi
The issue is very likely related with
https://issues.apache.org/jira/browse/KYLIN-1624; You can wait for v1.5.2,
or pick the commits related with HLL (on master branch) made by Yang
yesterday.


2016-04-26 17:49 GMT+08:00 ShaoFeng Shi :

> Hi Dayue,
>
> could you please open a JIRA for this, and make it configurable? As I know
> now Kylin allow cube level's configurations to overwirte kylin.properties,
> with this you can customize the magic number at cube level.
>
> Thanks;
>
> 2016-04-25 15:01 GMT+08:00 Li Yang :
>
>> The magic coefficient is due to hbase compression on keys and values, the
>> final cube size is much smaller than the sum of all keys and all values.
>> That's why multiplying the coefficient. It's totally by experience at the
>> moment. It should vary depends on the key encoding and compression applied
>> to HTable.
>>
>> At the minimal, we should make it configurable I think.
>>
>> On Mon, Apr 18, 2016 at 4:38 PM, Dayue Gao  wrote:
>>
>> > Hi everyone,
>> >
>> >
>> > I made several cubing tests on 1.5 and found most of the time was spent
>> on
>> > the "Convert Cuboid Data to HFile" step due to lack of reducer
>> parallelism.
>> > It seems that the estimated cube size is too small compared to the
>> actual
>> > size, which leads to small number of regions (hence reducers) to be
>> > created. The setup and result of the tests are like:
>> >
>> >
>> > Cube#1: source_record=11998051, estimated_size=8805MB, coefficient=0.25,
>> > region_cut=5GB, #regions=2, actual_size=49GB
>> > Cube#2: source_record=123908390, estimated_size=4653MB,
>> coefficient=0.05,
>> > region_cut=10GB, #regions=2, actual_size=144GB
>> >
>> >
>> > The "coefficient" is from CubeStatsReader#estimateCuboidStorageSize,
>> which
>> > looks mysterious to me. Currently the formula for cuboid size
>> estimation is
>> >
>> >
>> >   size(cuboid) = rows(cuboid) x row_size(cuboid) x coefficient
>> >   where coefficient = has_memory_hungry_measures(cube) ? 0.05 : 0.25
>> >
>> >
>> > Why do we multiply the coefficient? And why it's five times smaller in
>> > memory hungry case? Cloud someone explain the rationale behind it?
>> >
>> >
>> > Thanks, Dayue
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>>
>
>
>
> --
> Best regards,
>
> Shaofeng Shi
>
>


-- 
Best regards,

Shaofeng Shi


Re: Question about cube size estimation in Kylin 1.5

2016-04-26 Thread ShaoFeng Shi
Hi Dayue,

could you please open a JIRA for this, and make it configurable? As I know
now Kylin allow cube level's configurations to overwirte kylin.properties,
with this you can customize the magic number at cube level.

Thanks;

2016-04-25 15:01 GMT+08:00 Li Yang :

> The magic coefficient is due to hbase compression on keys and values, the
> final cube size is much smaller than the sum of all keys and all values.
> That's why multiplying the coefficient. It's totally by experience at the
> moment. It should vary depends on the key encoding and compression applied
> to HTable.
>
> At the minimal, we should make it configurable I think.
>
> On Mon, Apr 18, 2016 at 4:38 PM, Dayue Gao  wrote:
>
> > Hi everyone,
> >
> >
> > I made several cubing tests on 1.5 and found most of the time was spent
> on
> > the "Convert Cuboid Data to HFile" step due to lack of reducer
> parallelism.
> > It seems that the estimated cube size is too small compared to the actual
> > size, which leads to small number of regions (hence reducers) to be
> > created. The setup and result of the tests are like:
> >
> >
> > Cube#1: source_record=11998051, estimated_size=8805MB, coefficient=0.25,
> > region_cut=5GB, #regions=2, actual_size=49GB
> > Cube#2: source_record=123908390, estimated_size=4653MB, coefficient=0.05,
> > region_cut=10GB, #regions=2, actual_size=144GB
> >
> >
> > The "coefficient" is from CubeStatsReader#estimateCuboidStorageSize,
> which
> > looks mysterious to me. Currently the formula for cuboid size estimation
> is
> >
> >
> >   size(cuboid) = rows(cuboid) x row_size(cuboid) x coefficient
> >   where coefficient = has_memory_hungry_measures(cube) ? 0.05 : 0.25
> >
> >
> > Why do we multiply the coefficient? And why it's five times smaller in
> > memory hungry case? Cloud someone explain the rationale behind it?
> >
> >
> > Thanks, Dayue
> >
> >
> >
> >
> >
> >
> >
> >
>



-- 
Best regards,

Shaofeng Shi


Question about cube size estimation in Kylin 1.5

2016-04-18 Thread Dayue Gao
Hi everyone,


I made several cubing tests on 1.5 and found most of the time was spent on the 
"Convert Cuboid Data to HFile" step due to lack of reducer parallelism. It 
seems that the estimated cube size is too small compared to the actual size, 
which leads to small number of regions (hence reducers) to be created. The 
setup and result of the tests are like:


Cube#1: source_record=11998051, estimated_size=8805MB, coefficient=0.25, 
region_cut=5GB, #regions=2, actual_size=49GB
Cube#2: source_record=123908390, estimated_size=4653MB, coefficient=0.05, 
region_cut=10GB, #regions=2, actual_size=144GB


The "coefficient" is from CubeStatsReader#estimateCuboidStorageSize, which 
looks mysterious to me. Currently the formula for cuboid size estimation is


  size(cuboid) = rows(cuboid) x row_size(cuboid) x coefficient
  where coefficient = has_memory_hungry_measures(cube) ? 0.05 : 0.25


Why do we multiply the coefficient? And why it's five times smaller in memory 
hungry case? Cloud someone explain the rationale behind it?


Thanks, Dayue