[4/4] kylin git commit: revise alberto's tutorial

shaofengshi Wed, 29 Mar 2017 02:08:08 -0700

revise alberto's tutorial


Project: http://git-wip-us.apache.org/repos/asf/kylin/repo
Commit: http://git-wip-us.apache.org/repos/asf/kylin/commit/a34f34ea
Tree: http://git-wip-us.apache.org/repos/asf/kylin/tree/a34f34ea
Diff: http://git-wip-us.apache.org/repos/asf/kylin/diff/a34f34ea

Branch: refs/heads/document
Commit: a34f34ea6e3cf26e6bb1de36fcedaba35d2c71e5
Parents: 210a249
Author: shaofengshi <shaofeng...@apache.org>
Authored: Wed Mar 29 17:07:38 2017 +0800
Committer: shaofengshi <shaofeng...@apache.org>
Committed: Wed Mar 29 17:07:38 2017 +0800

----------------------------------------------------------------------
 website/_docs20/index.md                        |  2 +-
 .../_docs20/tutorial/cube_build_performance.md  | 55 ++++++--------------
 2 files changed, 18 insertions(+), 39 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kylin/blob/a34f34ea/website/_docs20/index.md
----------------------------------------------------------------------
diff --git a/website/_docs20/index.md b/website/_docs20/index.md
index b20e0a1..ada6b67 100644
--- a/website/_docs20/index.md
+++ b/website/_docs20/index.md
@@ -34,7 +34,7 @@ Tutorial
 5. [SQL reference: by Apache 
Calcite](http://calcite.apache.org/docs/reference.html)
 6. [Build Cube with Streaming Data](tutorial/cube_streaming.html)
 7. [Build Cube with Spark Engine (beta)](tutorial/cube_spark.html)
-8. [Kylin Cube build tuning step by step](tutorial/cube_build_performance.html)
+8. [Cube Build Tuning Step by Step](tutorial/cube_build_performance.html)
 
 
 

http://git-wip-us.apache.org/repos/asf/kylin/blob/a34f34ea/website/_docs20/tutorial/cube_build_performance.md
----------------------------------------------------------------------
diff --git a/website/_docs20/tutorial/cube_build_performance.md 
b/website/_docs20/tutorial/cube_build_performance.md
index bec63dc..d1375fe 100755
--- a/website/_docs20/tutorial/cube_build_performance.md
+++ b/website/_docs20/tutorial/cube_build_performance.md
@@ -1,32 +1,26 @@
 ---
 layout: docs20
-title: Kylin Cube build tuning step by step
+title: Cube Build Tuning Step by Step
 categories: tutorial
 permalink: /docs20/tutorial/cube_build_performance.html
 ---
- *This tutorial is an example step by step about how to optimize build of cube*
+ *This tutorial is an example step by step about how to optimize build of 
cube.* 
  
- *Thanks to ShaoFeng Shi for help*
-
-
-
-
-Try to optimize a very simple Cube, with 1 Dim and 1 Fact table (Date 
Dimension)
+In this scenario we're trying to optimize a very simple Cube, with 1 fact and 
1 lookup table (Date Dimension). Before do a real tunning, please get an 
overall understanding about Cube build process from [Optimize Cube 
Build](/docs20/howto/howto_optimize_build.html)
 
 ![]( /images/tutorial/2.0/cube_build_performance/01.png)
 
-
 The baseline is:
 
 * One Measure: Balance, calculate always Max, Min and Count
 * All Dim_date (10 items) will be used as dimensions 
 * Input is a Hive CSV external table 
-* Output is a Cube in HBase with out compression 
+* Output is a Cube in HBase without compression 
 
 With this configuration, the results are: 13 min to build a cube of 20 Mb  
(Cube_01)
 
-### Cube_02: Reduce cardinality
-To make the first improvement, use Joint and Hierarchy on Dimensions to reduce 
the cardinality.
+### Cube_02: Reduce combinations
+To make the first improvement, use Joint and Hierarchy on Dimensions to reduce 
the combinations (number of cuboids).
 
 Put together all ID and Text of: Month, Week, Weekday and Quarter using Joint 
Dimension
 
@@ -37,28 +31,25 @@ Define Id_date and Year as a Hierarchy Dimension
 
 This reduces the size down to 0.72MB and time to 5 min
 
-
-
 [Kylin 2149](https://issues.apache.org/jira/browse/KYLIN-2149), ideally, these 
Hierarchies can be defined also:
 * Id_weekday > Id_date
 * Id_Month > Id_date
 * Id_Quarter > Id_date
 * Id_week > Id_date
 
-But for now, it isnât possible to use Joint and hierarchy together in one 
Dim   :(
+But for now, it impossible to use Joint and Hierarchy together for one 
dimension.
 
 
-### Cube_03: Compress output Cube
+### Cube_03: Compress output
 To make the next improvement, compress HBase Cube with Snappy:
 
 ![alt text](/images/tutorial/2.0/cube_build_performance/03.png)
 
-Another option is to Now we can try compress HBase Cube with Gzip:
+Another option is Gzip:
 
 ![alt text](/images/tutorial/2.0/cube_build_performance/04.png)
 
 
-
 The results of compression output are:
 
 ![alt text](/images/tutorial/2.0/cube_build_performance/05.png)
@@ -78,18 +69,17 @@ Group detailed times by concepts :
 
 67 % is used to build / process flat table and respect 30% to build the cube
 
-A lot of time is used in the first steps! 
+A lot of time is used in the first steps.
 
 This time distribution is typical in a cube with few measures and few dim (or 
very optimized)
 
 
-
 Try to use ORC Format and compression on Hive input table (Snappy):
 
 ![]( /images/tutorial/2.0/cube_build_performance/08.png)
 
 
-The time in the first three stree steps (Flat Table) has been improved by half 
 :)
+The time in the first three stree steps (Flat Table) has been improved by half.
 
 Other columnar formats can be tested:
 
@@ -99,7 +89,7 @@ Other columnar formats can be tested:
 * ORC
 * ORC compressed with Snappy
 
-But the results are worse than when using Sequence file â¦
+But the results are worse than when using Sequence file.
 
 See comments about this here: [Shaofengshi in 
MailList](http://apache-kylin.74782.x6.nabble.com/Kylin-Performance-td6713.html#a6767)
 
@@ -107,14 +97,13 @@ The second strep is to redistribute Flat Hive table:
 
 ![]( /images/tutorial/2.0/cube_build_performance/20.png)
 
-
 Is a simple row count, two approximations can be made
 * If it doesnât need to be accurate, the rows of the fact table can be 
countedâ this can be performed in parallel with Step 1 (and 99% of the time 
it will be accurate)
 
 ![]( /images/tutorial/2.0/cube_build_performance/21.png)
 
 
-* See comments about this from Shaofengshi in MailList . In the future 
versions (Kylin 2265 v2.0), this steps will be implemented using Hive table 
statistics.
+* In the future versions (KYLIN-2165 v2.0), this steps will be implemented 
using Hive table statistics.
 
 
 
@@ -140,6 +129,7 @@ WHERE (ID_DATE >= '2016-12-08' AND ID_DATE < '2016-12-23')
 {% endhighlight %}
 
 The problem here, is that, Hive in only using 1 Map to create Flat Table. It 
is important to lets go to change this behavior. The solution is to partition 
DIM and FACT in the same columns
+
 * Option 1: Use id_date as a partition column on Hive table. This has a big 
problem: the Hive metastore is meant for few a hundred of partitions and not 
thousands (In [Hive 9452](https://issues.apache.org/jira/browse/HIVE-9452) 
there is an idea to solve this but it isnât finished yet)
 * Option 2: Generate a new column for this purpose like Monthslot.
 
@@ -217,12 +207,11 @@ How can the performance of Map â Reduce be improved? 
The easy way is to increa
 * yarn.nodemanager.resource.memory-mb = 15 GB
 * yarn.scheduler.maximum-allocation-mb = 8 GB
 * yarn.nodemanager.resource.cpu-vcores = 8 cores
-With this config our max theoreticaleorical grade of parallelismelist is 8. 
However, t , but this has a problem: âTimed out after 3600 secsâ
+With this config our max theoreticaleorical grade of parallelismelist is 8. 
However, but this has a problem: âTimed out after 3600 secsâ
 
 ![]( /images/tutorial/2.0/cube_build_performance/26.png)
 
 
-
 The parameter mapreduce.task.timeout  (1 hour by default) define max time that 
Application Master (AM) can happen with out ACK of Yarn Container. Once this 
time passes, AM kill the container and retry the same 4 times (with the same 
result)
 
 Where is the problem? The problem is that 4 mappers started, but each mapper 
needed more than 4 GB to finish
@@ -243,7 +232,7 @@ During a normal âBuild Cubeâ step you will see similars 
messages on YARN log
 ![]( /images/tutorial/2.0/cube_build_performance/28.png)
 
 
-If you donât see this periodically, perhaps you have a bottleneck in the 
memory
+If you donât see this periodically, perhaps you have a bottleneck in the 
memory.
 
 
 
@@ -263,8 +252,7 @@ In our case we define 3 Aggregations Groups:
 
 ![]( /images/tutorial/2.0/cube_build_performance/31.png)
 
-       
-       
+
 
 Compare without / with AGGs:
 
@@ -276,12 +264,3 @@ Now it uses 3% more of time to build the cube and 0.6% of 
space, but queries by
 
 
 
-
-
-**__For any suggestions, feel free to contact me__**
-
-**__Thanks, Alberto__**
-
-
-
-

[4/4] kylin git commit: revise alberto's tutorial

Reply via email to