Re: New document: "How to optimize cube build"

ShaoFeng Shi Mon, 13 Feb 2017 00:41:49 -0800

correct.

Get Outlook for iOS





On Mon, Feb 13, 2017 at 3:52 PM +0800, "Ajay Chitre" <chitre.a...@gmail.com> 
wrote:










In this case, if user runs a query with a WHERE clause that has 2 dimensions 
from the "aggregation group" & 2 dimensions from the "other 5 dimensions", 
Kylin will compute the results from the base cuboid, correct? Or would it error 
out?

I can test it myself but I am being lazy -:) Looking for a quick answer from 
the experts. Thanks for your help.
On Sun, Feb 12, 2017 at 3:04 AM, ShaoFeng Shi <shaofeng...@apache.org> wrote:
Ajay,
There is no such a setting, but the "aggregation group" has something similar; 
say the cube totally has 15 dimensions, but in the agg group you only pick up 
10 dimensions, then Kylin will build totally 1 (base cuboid) + 2^10 -1 
(combinations of the 10 dimensions); Use this way you can leave those 5 
dimension only appear on the base cuboid.
2017-02-09 9:20 GMT+08:00 Ajay Chitre <chitre.a...@gmail.com>:
My question was a general question. Not any specific issue that I am 
encountering -:)

I understand that we can prune by using Hierarchical dimensions, aggregation 
groups etc. But what if these types of aggregations are not possible.

Let's say I've 15 dimensions (& I can't prune any), would Kylin build 32,766 
Cuboids or is there a property to say... "If no. of dimensions are over X, stop 
building more Cuboids. Get from the base"? (Knowing this will slow down the 
queries).

Please let me know. Thanks.


On Mon, Feb 6, 2017 at 5:43 AM, ShaoFeng Shi <shaofeng...@gmail.com> wrote:
Ajay, thanks for your feedback;
For question 1, the code has been merged in master branch; next release would 
be 2.0; a beta release will be published soon.
For question 2, yes your understanding is correct: a N dim FULL cube will have 
2^N - 1 cuboids; but if you adopted some way like hierarchy, joint or 
separating dimensions to multi groups, it will be a "partial" cube which means 
some cuboids will be pruned. 
If a query uses dimensions across aggregation groups, then only the base cuboid 
can fulfill it, kylin has to do the post aggregation from the base cuboid, the 
performance would be downgraded. Please check whether it's this case in your 
side.
Get Outlook for iOS




On Mon, Feb 6, 2017 at 2:05 PM +0900, "Ajay Chitre" <chitre.a...@gmail.com> 
wrote:










Thanks for writing this document. It's very helpful. I've following questions:

1) Doc says... "Kylin will build dictionaries in memory (in next version this 
will be moved to MR)".

Which version can we expect this in? For large Cubes this process takes a long 
time on local machine. We really need to move this to the Hadoop cluster. In 
fact, it will be great if we can have an option to run this under Spark -:) 

2) About the "Build N-Dimension Cuboid" step.

Does Kylin build ALL Cuboids? My understanding is:

Total no. of Cuboids = (2 to the power of # of dimensions) - 1

Correct?

So if there are 7 dimensions, there will be 127 Cuboids, right? Does Kylin 
create ALL of them?

I was under the impression that, after some point, Kylin will just get measures 
from the Base Cuboid; instead of building all of them. Please explain.

Thanks.



On Sat, Feb 4, 2017 at 2:19 AM, Li Yang <liy...@apache.org> wrote:
Be free to update the document with different opinions. :-)

On Thu, Jan 26, 2017 at 11:34 AM, ShaoFeng Shi <shaofeng...@apache.org> wrote:
Hi Alberto,
Thanks for your comments! In many cases the data is imported to Hadoop in T+1 
mode. Especially when everyday's data is tens of GB, it is reasonable to 
partition the Hive table by date. The problem is whether it worth to keep a 
long history data in Hive; Usually user only keep a couple monthes' data in 
Hive; If the partition number exceeds the threshold in Hive, he/she can remove 
the oldest partitions or move to another table easily; That is a common 
practice of Hive I think, and it is very good to know that Hive 2.0 will solve 
this. 
2017-01-25 17:10 GMT+08:00 Alberto Ramón <a.ramonporto...@gmail.com>:
Be careful about partition by "FLIGHTDATE"

>From https://github.com/albertoRamon/Kylin/tree/master/KylinPerformance

"Option 1: Use id_date as partition column on Hive table. This have a big
 problem: the Hive metastore is meant for few hundred of partitions not 
thousand (Hive 9452 there is an idea to solve this isn’t in progress)"

In Hive 2.0 will be a preview (only for testing) to solve this

2017-01-25 9:46 GMT+01:00 ShaoFeng Shi <shaofeng...@apache.org>:
Hello,
A new document is added for the practices of cube build. Any suggestion or 
comment is welcomed. We can update the doc later with feedbacks;
Here is the link:https://kylin.apache.org/docs16/howto/howto_optimize_build.html

-- 
Best regards,
Shaofeng Shi 史少锋







-- 
Best regards,
Shaofeng Shi 史少锋
















-- 
Best regards,
Shaofeng Shi 史少锋

Re: New document: "How to optimize cube build"

Reply via email to