The magic coefficient is due to hbase compression on keys and values, the
final cube size is much smaller than the sum of all keys and all values.
That's why multiplying the coefficient. It's totally by experience at the
moment. It should vary depends on the key encoding and compression applied
to HTable.

At the minimal, we should make it configurable I think.

On Mon, Apr 18, 2016 at 4:38 PM, Dayue Gao <dayue_...@163.com> wrote:

> Hi everyone,
>
>
> I made several cubing tests on 1.5 and found most of the time was spent on
> the "Convert Cuboid Data to HFile" step due to lack of reducer parallelism.
> It seems that the estimated cube size is too small compared to the actual
> size, which leads to small number of regions (hence reducers) to be
> created. The setup and result of the tests are like:
>
>
> Cube#1: source_record=11998051, estimated_size=8805MB, coefficient=0.25,
> region_cut=5GB, #regions=2, actual_size=49GB
> Cube#2: source_record=123908390, estimated_size=4653MB, coefficient=0.05,
> region_cut=10GB, #regions=2, actual_size=144GB
>
>
> The "coefficient" is from CubeStatsReader#estimateCuboidStorageSize, which
> looks mysterious to me. Currently the formula for cuboid size estimation is
>
>
>   size(cuboid) = rows(cuboid) x row_size(cuboid) x coefficient
>   where coefficient = has_memory_hungry_measures(cube) ? 0.05 : 0.25
>
>
> Why do we multiply the coefficient? And why it's five times smaller in
> memory hungry case? Cloud someone explain the rationale behind it?
>
>
> Thanks, Dayue
>
>
>
>
>
>
>
>

Reply via email to