Re: Question about cube size estimation in Kylin 1.5
The issue is very likely related with https://issues.apache.org/jira/browse/KYLIN-1624; You can wait for v1.5.2, or pick the commits related with HLL (on master branch) made by Yang yesterday. 2016-04-26 17:49 GMT+08:00 ShaoFeng Shi: > Hi Dayue, > > could you please open a JIRA for this, and make it configurable? As I know > now Kylin allow cube level's configurations to overwirte kylin.properties, > with this you can customize the magic number at cube level. > > Thanks; > > 2016-04-25 15:01 GMT+08:00 Li Yang : > >> The magic coefficient is due to hbase compression on keys and values, the >> final cube size is much smaller than the sum of all keys and all values. >> That's why multiplying the coefficient. It's totally by experience at the >> moment. It should vary depends on the key encoding and compression applied >> to HTable. >> >> At the minimal, we should make it configurable I think. >> >> On Mon, Apr 18, 2016 at 4:38 PM, Dayue Gao wrote: >> >> > Hi everyone, >> > >> > >> > I made several cubing tests on 1.5 and found most of the time was spent >> on >> > the "Convert Cuboid Data to HFile" step due to lack of reducer >> parallelism. >> > It seems that the estimated cube size is too small compared to the >> actual >> > size, which leads to small number of regions (hence reducers) to be >> > created. The setup and result of the tests are like: >> > >> > >> > Cube#1: source_record=11998051, estimated_size=8805MB, coefficient=0.25, >> > region_cut=5GB, #regions=2, actual_size=49GB >> > Cube#2: source_record=123908390, estimated_size=4653MB, >> coefficient=0.05, >> > region_cut=10GB, #regions=2, actual_size=144GB >> > >> > >> > The "coefficient" is from CubeStatsReader#estimateCuboidStorageSize, >> which >> > looks mysterious to me. Currently the formula for cuboid size >> estimation is >> > >> > >> > size(cuboid) = rows(cuboid) x row_size(cuboid) x coefficient >> > where coefficient = has_memory_hungry_measures(cube) ? 0.05 : 0.25 >> > >> > >> > Why do we multiply the coefficient? And why it's five times smaller in >> > memory hungry case? Cloud someone explain the rationale behind it? >> > >> > >> > Thanks, Dayue >> > >> > >> > >> > >> > >> > >> > >> > >> > > > > -- > Best regards, > > Shaofeng Shi > > -- Best regards, Shaofeng Shi
Re: Question about cube size estimation in Kylin 1.5
Hi Dayue, could you please open a JIRA for this, and make it configurable? As I know now Kylin allow cube level's configurations to overwirte kylin.properties, with this you can customize the magic number at cube level. Thanks; 2016-04-25 15:01 GMT+08:00 Li Yang: > The magic coefficient is due to hbase compression on keys and values, the > final cube size is much smaller than the sum of all keys and all values. > That's why multiplying the coefficient. It's totally by experience at the > moment. It should vary depends on the key encoding and compression applied > to HTable. > > At the minimal, we should make it configurable I think. > > On Mon, Apr 18, 2016 at 4:38 PM, Dayue Gao wrote: > > > Hi everyone, > > > > > > I made several cubing tests on 1.5 and found most of the time was spent > on > > the "Convert Cuboid Data to HFile" step due to lack of reducer > parallelism. > > It seems that the estimated cube size is too small compared to the actual > > size, which leads to small number of regions (hence reducers) to be > > created. The setup and result of the tests are like: > > > > > > Cube#1: source_record=11998051, estimated_size=8805MB, coefficient=0.25, > > region_cut=5GB, #regions=2, actual_size=49GB > > Cube#2: source_record=123908390, estimated_size=4653MB, coefficient=0.05, > > region_cut=10GB, #regions=2, actual_size=144GB > > > > > > The "coefficient" is from CubeStatsReader#estimateCuboidStorageSize, > which > > looks mysterious to me. Currently the formula for cuboid size estimation > is > > > > > > size(cuboid) = rows(cuboid) x row_size(cuboid) x coefficient > > where coefficient = has_memory_hungry_measures(cube) ? 0.05 : 0.25 > > > > > > Why do we multiply the coefficient? And why it's five times smaller in > > memory hungry case? Cloud someone explain the rationale behind it? > > > > > > Thanks, Dayue > > > > > > > > > > > > > > > > > -- Best regards, Shaofeng Shi
Question about cube size estimation in Kylin 1.5
Hi everyone, I made several cubing tests on 1.5 and found most of the time was spent on the "Convert Cuboid Data to HFile" step due to lack of reducer parallelism. It seems that the estimated cube size is too small compared to the actual size, which leads to small number of regions (hence reducers) to be created. The setup and result of the tests are like: Cube#1: source_record=11998051, estimated_size=8805MB, coefficient=0.25, region_cut=5GB, #regions=2, actual_size=49GB Cube#2: source_record=123908390, estimated_size=4653MB, coefficient=0.05, region_cut=10GB, #regions=2, actual_size=144GB The "coefficient" is from CubeStatsReader#estimateCuboidStorageSize, which looks mysterious to me. Currently the formula for cuboid size estimation is size(cuboid) = rows(cuboid) x row_size(cuboid) x coefficient where coefficient = has_memory_hungry_measures(cube) ? 0.05 : 0.25 Why do we multiply the coefficient? And why it's five times smaller in memory hungry case? Cloud someone explain the rationale behind it? Thanks, Dayue