This is an automated email from the ASF dual-hosted git repository. nic pushed a commit to branch document in repository https://gitbox.apache.org/repos/asf/kylin.git
The following commit(s) were added to refs/heads/document by this push: new 443e7b0 Update configuration page for 3.0.0 realtime olap. 443e7b0 is described below commit 443e7b0c511ab3a8044e9d7ece72fa24835a0d75 Author: XiaoxiangYu <hit_la...@126.com> AuthorDate: Tue Dec 24 17:05:49 2019 +0800 Update configuration page for 3.0.0 realtime olap. --- website/_data/docs.yml | 1 + website/_docs/install/configuration.cn.md | 70 +++++---- website/_docs/install/configuration.md | 37 +++-- .../lambda_mode_and_timezone_realtime_olap.md | 175 +++++++++++++++++++++ website/_docs/tutorial/real_time_olap.md | 5 +- website/_docs30/tutorial/real_time_olap.md | 1 + website/download/index.cn.md | 2 +- website/download/index.md | 2 +- website/images/RealtimeOlap/Before-Submit.png | Bin 0 -> 357148 bytes .../images/RealtimeOlap/CreateStreamingModel.png | Bin 0 -> 39540 bytes website/images/RealtimeOlap/JobMonitor.png | Bin 0 -> 594207 bytes website/images/RealtimeOlap/LambdaCubeSegment.png | Bin 0 -> 176723 bytes website/images/RealtimeOlap/Table-Meta-1.png | Bin 0 -> 136491 bytes website/images/RealtimeOlap/Table-Meta-2.png | Bin 0 -> 181488 bytes website/images/RealtimeOlap/Table-Meta-3.png | Bin 0 -> 42778 bytes .../images/RealtimeOlap/Timezone-checkresult.png | Bin 0 -> 167890 bytes 16 files changed, 247 insertions(+), 46 deletions(-) diff --git a/website/_data/docs.yml b/website/_data/docs.yml index d3ada2a..5c99520 100644 --- a/website/_data/docs.yml +++ b/website/_data/docs.yml @@ -52,6 +52,7 @@ - tutorial/setup_jdbc_datasource - tutorial/hybrid - tutorial/mysql_metastore + - tutorial/lambda_mode_and_timezone_realtime_olap - title: Integration docs: diff --git a/website/_docs/install/configuration.cn.md b/website/_docs/install/configuration.cn.md index bbb9905..8099c6a 100644 --- a/website/_docs/install/configuration.cn.md +++ b/website/_docs/install/configuration.cn.md @@ -567,35 +567,47 @@ Kylin 可以使用三种类型的压缩,分别是 HBase 表压缩,Hive 输 ### 实时 OLAP {#realtime-olap} -- `kylin.stream.job.dfs.block.size`:指定了流式构建 Base Cuboid 任务所需 HDFS 块的大小。默认值为 *16M*。 -- `kylin.stream.index.path`:指定了本地 segment 缓存的位置。默认值为 *stream_index*。 -- `kylin.stream.cube-num-of-consumer-tasks`:指定了共享同一个 topic 分区的 replica set 数量,影响着不同 replica set 分配的分区数量。默认值为 *3*。 -- `kylin.stream.cube.window`:指定了每个 segment 的持续时长,以秒为单位。默认值为 *3600*。 -- `kylin.stream.cube.duration`:指定了 segment 从 active 状态变为 IMMUTABLE 状态的等待时间,以秒为单位。默认值为 *7200*。 -- `kylin.stream.cube.duration.max`:segment 的 active 状态的最长持续时间,以秒为单位。默认值为 *43200*。 -- `kylin.stream.checkpoint.file.max.num`:指定了每个 Cube 包含的 checkpoint 文件数的最大值。默认值为 *5*。 -- `kylin.stream.index.checkpoint.intervals`:指定了两个 checkpoint 设置的时间间隔。默认值为 *300*。 -- `kylin.stream.index.maxrows`:指定了缓存在堆/内存中的事件数的最大值。默认值为 *50000*。 -- `kylin.stream.immutable.segments.max.num`:指定了当前 receiver 里每个 Cube 中状态为 IMMUTABLE 的 segment 的最大数值,如果超过最大值,当前 topic 的消费将会被暂停。默认值为 *100*。 -- `kylin.stream.consume.offsets.latest`:是否从最近的偏移量开始消费。默认值为 *true*。 -- `kylin.stream.node`:指定了 coordinator/receiver 的节点。形如 host:port。默认值为 *null*。 -- `kylin.stream.metadata.store.type`:指定了元数据存储的位置。默认值为 *zk*。 -- `kylin.stream.segment.retention.policy`:指定了当 segment 变为 IMMUTABLE 状态时,本地 segment 缓存的处理策略。参数值可选 `purge` 和 `fullBuild`。`purge` 意味着当 segment 的状态变为 IMMUTABLE,本地缓存的 segment 数据将被删除。`fullBuild` 意味着当 segment 的状态变为 IMMUTABLE,本地缓存的 segment 数据将被上传到 HDFS。默认值为 *fullBuild*。 -- `kylin.stream.assigner`:指定了用于将 topic 分区分配给不同 replica set 的实现类。该类实现了 `org.apache.kylin.stream.coordinator.assign.Assigner` 类。默认值为 *DefaultAssigner*。 -- `kylin.stream.coordinator.client.timeout.millsecond`:指定了连接 coordinator 客户端的超时时间。默认值为 *5000*。 -- `kylin.stream.receiver.client.timeout.millsecond`:指定了连接 receiver 客户端的超时时间。默认值为 *5000*。 -- `kylin.stream.receiver.http.max.threads`:指定了连接 receiver 的最大线程数。默认值为 *200*。 -- `kylin.stream.receiver.http.min.threads`:指定了连接 receiver 的最小线程数。默认值为 *10*。 -- `kylin.stream.receiver.query-core-threads`:指定了当前 receiver 用于查询的线程数。默认值为 *50*。 -- `kylin.stream.receiver.query-max-threads`:指定了当前 receiver 用于查询的最大线程数。默认值为 *200*。 -- `kylin.stream.receiver.use-threads-per-query`:指定了每个查询使用的线程数。默认值为 *8*。 -- `kylin.stream.build.additional.cuboids`:是否构建除 Base Cuboid 外的 cuboids。除 Base Cuboid 外的 cuboids 指的是在 Cube 的 Advanced Setting 页面选择的强制维度的聚合。默认值为 *false*。默认只构建 Base Cuboid。 -- `kylin.stream.segment-max-fragments`:指定了每个 segment 保存的最大 fragment 数。默认值为 *50*。 -- `kylin.stream.segment-min-fragments`:指定了每个 segment 保存的最小 fragment 数。默认值为 *15*。 -- `kylin.stream.max-fragment-size-mb`:指定了每个 fragment 文件的最大尺寸。默认值为 *300*。 -- `kylin.stream.fragments-auto-merge-enable`:是否开启 fragment 文件自动合并的功能。默认值为 *true*。 - -> 提示:更多信息请参考 [Real-time OLAP](http://kylin.apache.org/docs30/tutorial/real_time_olap.html)。 + +#### 全局设置 + +- `kylin.stream.job.dfs.block.size`: 指定了流式构建 Cuboid 任务所需 HDFS 块的大小。默认值为 *16M*。 +- `kylin.stream.index.path`: 指定了存储segment cache file的本地路径(包括本地fragment file和checkpoint file)。支持相对路径和绝对路径,默认值是 *stream_index*,也就是写到`$KYLIN_HOME/stream_index`,如果数据量很大的话将会占用大量磁盘空间,您也可以根据您的需求写成绝对路径以将数据放到数据盘。 +- `kylin.stream.node`: 指定了 receiver/coordinator的地址。格式应该为`hostname:port`或者`port`。如果设置成`port`,Kylin将会自动补全hostname;如果不设置该属性,将会使用默认的端口(Coordinator:7070,Receiver:9090)。当进程启动时,会将自身注册到Metadata。 +- `kylin.stream.metadata.store.type`: 指定了Realtime集群信息的元数据存储。默认值是 *zk*。 +- `kylin.stream.receiver.use-threads-per-query`: 指定了每个查询使用的线程资源数量。默认值是*8*。 + +#### Cube 级别设置 + +- `kylin.stream.index.maxrows`: 指定了缓存在堆内的聚合后的事件最大行数。默认值是*50000*。这个参数会影响Fragment File的数量,可以根据需求适当调高。 +- `kylin.stream.cube-num-of-consumer-tasks`: 指定了一个topic的全部消息的摄入将由哪多少Replica Set来负责。如果您的消息速率较大,需要适当提升这个数值。默认值是*3*。 +- `kylin.stream.segment.retention.policy`: 当Segment状态变为*IMMUTABLE*,该配置指定了Receiver如何处理本地Segment Cache。可选值包含`purge`和`fullBuild`。设置为`purge`后,Receiver会等待一定时间后删除本地数据;设置为`fullBuild`后,数据会上传到HDFS并等待构建。默认值是*fullBuild*。 +- `kylin.stream.build.additional.cuboids`: 默认情况下Receiver只构建base cuboid来回答查询,可以在Receiver端是否构建额外的cuboid,如果你希望优化某些查询的响应时间。具体哪些额外的Cuboid需要被构建由高级配置页面的强制Cuboid指定。 +- `kylin.stream.cube.window`: 指定了Streaming Segment的长度。默认值是*3600*。详情参阅[deep-dive-real-time-olap](http://kylin.apache.org/blog/2019/07/01/deep-dive-real-time-olap/)。 +- `kylin.stream.cube.duration`: 指定了Streaming Segment会等待迟到的消息多久,默认值是 *7200*(秒)。 详情参阅[deep-dive-real-time-olap](http://kylin.apache.org/blog/2019/07/01/deep-dive-real-time-olap/)。 +- `kylin.stream.cube.duration.max`: 指定了Streaming Segment保持Active的最长时间。默认值是 *43200*。详情参阅[deep-dive-real-time-olap](http://kylin.apache.org/blog/2019/07/01/deep-dive-real-time-olap/)。 +- `kylin.stream.checkpoint.file.max.num`: 指定了Receiver为每一个Cube保留的checkpoint文件数量。默认值是 *5*。 +- `kylin.stream.index.checkpoint.intervals`: 指定了Receiver进行checkpoint的间隔。默认值是 *300*。 +- `kylin.stream.immutable.segments.max.num`: 指定了在Receiver端,一个Cube最多可以保持多少个*IMMUTABLE*segment,因为Receiver端的性能和Fragment File的数量呈负相关。默认值是 *100*。 +- `kylin.stream.consume.offsets.latest`:指定了Receiver从什么位置开始消费,设置成*true*则从最新的offset开始消费,false则从最老的位置消费。默认值是 *true*。 + +#### 高级设置 + +- `kylin.stream.assigner`: 值是一个类的名字,这个类应该是`org.apache.kylin.stream.coordinator.assign.Assigner`的实现类,用于指定如何将Kafka Topic 下的各个Partition分配给各个Replica Set。默认值是 *DefaultAssigner*,其策略会努力将工作负载分配给负责partition数量少的Replica Set,以使得各个Replica Set工作负载相对均衡。 +- `kylin.stream.coordinator.client.timeout.millsecond`: 指定和Coordinator HTTP连接的Timeout,默认值是 *5000*。 +- `kylin.stream.receiver.client.timeout.millsecond`:指定和Receiver HTTP连接的Timeout,默认值是 *5000*。 +- `kylin.stream.receiver.http.max.threads`: 指定了Receiver端的Http连接最大线程数。默认值为 *200*。 +- `kylin.stream.receiver.http.min.threads`: 指定了Receiver端的Http连接最小线程数。默认值为 *10*。 +- `kylin.stream.receiver.query-core-threads`: 指定了Receiver用于scan的线程数量,默认值是*50*。 +- `kylin.stream.receiver.query-max-threads`: 指定了Receiver用于scan的线程最大数量,默认值是*200*。 +- `kylin.stream.segment-max-fragments`: Receiver端每次MemoryStore大小达到阈值(`kylin.stream.index.maxrows`),会落盘形成一个Fragment File,Receiver会尝试尽可能合并这些Fragment File来减少数据冗余。这个配置项会指定触发merge的阈值,默认值是*50*。 +- `kylin.stream.segment-min-fragments`: Receiver端的每次merge后不会使文件数量少于这个阈值,默认值是 *15*。 +- `kylin.stream.max-fragment-size-mb`: 合并后,每个Fragment File的大小不会超过该值,默认值是 *300*。 +- `kylin.stream.fragments-auto-merge-enable`: 是否开启后台自动合并Fragment File。默认值是 *true*。 +- `kylin.stream.metrics.option`: 指定是否开启Receiver端的metrics信息收集, 可选值是 csv/console/jmx。 +- `kylin.stream.event.timezone`: 指定从Event Time衍生出来的时间衍生列如`HOUR_START`/`DAY_START`使用哪种时区,默认是UTC时间。 +- `kylin.stream.auto-resubmit-after-discard-enabled`: 当用户 discard了某一个 Realtime的构建任务,是否自动重新提交新任务。 + +> 提示:入门教程 请参考 [Real-time OLAP](/docs/tutorial/realtime_olap.html)。 diff --git a/website/_docs/install/configuration.md b/website/_docs/install/configuration.md index d8c5e1f..8e61c6d 100644 --- a/website/_docs/install/configuration.md +++ b/website/_docs/install/configuration.md @@ -565,20 +565,30 @@ This compression is configured via `kylin_job_conf.xml` and `kylin_job_conf_inme ### Real-time OLAP {#realtime-olap} +#### Global level config + - `kylin.stream.job.dfs.block.size`: specifies the HDFS block size of the streaming Base Cuboid job using. The default value is *16M*. -- `kylin.stream.index.path`: specifies the path to store local segment cache. The default value is *stream_index*. +- `kylin.stream.index.path`: specifies the local path to store segment cache files(including fragment and checkpoint files). The default value is *stream_index*. +- `kylin.stream.node`: specifies the node of coordinator/receiver. Value should be `hostname:port` or `port`. If set to `port`, Kylin will complete hostname automatically. When Kylin process started, it will register it into metadata. The default value is *null*. +- `kylin.stream.metadata.store.type`: specifies the position of metadata store. The default value is *zk*. This entry is trivial because it has only one option. +- `kylin.stream.receiver.use-threads-per-query`: specifies the threads number that each query use. The default value is *8*. + +#### Cube level config + +- `kylin.stream.index.maxrows`: specifies the maximum number of the aggregated event keep in JVM heap. The default value is *50000*. Try to advance it if you have enough heap size. - `kylin.stream.cube-num-of-consumer-tasks`: specifies the number of replica sets that share the whole topic partition. It affects how many partitions will be assigned to different replica sets. The default value is *3*. -- `kylin.stream.cube.window`: specifies the length of duration of each segment, value in seconds. The default value is *3600*. -- `kylin.stream.cube.duration`: specifies the wait time that a segment's status changes from active to IMMUTABLE, value in seconds. The default value is *7200*. -- `kylin.stream.cube.duration.max`: specifies the maximum duration that segment can keep active, value in seconds. The default value is *43200*. +- `kylin.stream.segment.retention.policy`: specifies the strategy to process local segment cache when segment become *IMMUTABLE*. Optional values include `purge` and `fullBuild`. `purge` means when the segment become *IMMUTABLE*, it will be deleted. `fullBuild` means when the segment become *IMMUTABLE*, it will be uploaded to HDFS. The default value is *fullBuild*. +- `kylin.stream.build.additional.cuboids`: whether to build additional Cuboids. The additional Cuboids mean the aggregation of Mandatory Dimensions that chosen in *Cube Advanced Setting* page. The default value is *false*. Only build Base Cuboid by default. Try to enable it if you care the QPS and most query pattern can be foresaw. +- `kylin.stream.cube.window`: specifies the length of duration of each segment, value in seconds. The default value is *3600*. Please check detail at[deep-dive-real-time-olap](http://kylin.apache.org/blog/2019/07/01/deep-dive-real-time-olap/). +- `kylin.stream.cube.duration`: specifies the wait time that a segment's status changes from active to IMMUTABLE, value in seconds. The default value is *7200*. Please check detail at[deep-dive-real-time-olap](http://kylin.apache.org/blog/2019/07/01/deep-dive-real-time-olap/). +- `kylin.stream.cube.duration.max`: specifies the maximum duration that segment can keep active, value in seconds. The default value is *43200*. Please check detail at[deep-dive-real-time-olap](http://kylin.apache.org/blog/2019/07/01/deep-dive-real-time-olap/). - `kylin.stream.checkpoint.file.max.num`: specifies the maximum number of checkpoint file for each cube. The default value is *5*. - `kylin.stream.index.checkpoint.intervals`: specifies the time interval between setting two checkpoints. The default value is *300*. -- `kylin.stream.index.maxrows`: specifies the maximum number of the entered event be cached in heap/memory. The default value is *50000*. - `kylin.stream.immutable.segments.max.num`: specifies the maximum number of the IMMUTABLE segment in each Cube of the current streaming receiver, if exceed, consumption of current topic will be paused. The default value is *100*. -- `kylin.stream.consume.offsets.latest`: whether to consume from the latest offset. The default value is *true*. -- `kylin.stream.node`: specifies the node of coordinator/receiver. Such as host:port. The default value is *null*. -- `kylin.stream.metadata.store.type`: specifies the position of metadata store. The default value is *zk*. -- `kylin.stream.segment.retention.policy`: specifies the strategy to process local segment cache when segment become IMMUTABLE. Optional values include `purge` and `fullBuild`. `purge` means when the segment become IMMUTABLE, it will be dropped. `fullBuild` means when the segment become IMMUTABLE, it will be uploaded to HDFS. The default value is *fullBuild*. +- `kylin.stream.consume.offsets.latest`: whether to consume from the latest offset or the earliest offset. The default value is *true*. + +#### Advanced config + - `kylin.stream.assigner`: specifies the implementation class which used to assign the topic partition to different replica sets. The class should be the implementation class of `org.apache.kylin.stream.coordinator.assign.Assigner`. The default value is *DefaultAssigner*. - `kylin.stream.coordinator.client.timeout.millsecond`: specifies the connection timeout of the coordinator client. The default value is *5000*. - `kylin.stream.receiver.client.timeout.millsecond`: specifies the connection timeout of the receiver client. The default value is *5000*. @@ -586,14 +596,15 @@ This compression is configured via `kylin_job_conf.xml` and `kylin_job_conf_inme - `kylin.stream.receiver.http.min.threads`: specifies the minimum connection threads of the receiver. The default value is *10*. - `kylin.stream.receiver.query-core-threads`: specifies the number of query threads be used for the current streaming receiver. The default value is *50*. - `kylin.stream.receiver.query-max-threads`: specifies the maximum number of query threads be used for the current streaming receiver. The default value is *200*. -- `kylin.stream.receiver.use-threads-per-query`: specifies the threads number that each query use. The default value is *8*. -- `kylin.stream.build.additional.cuboids`: whether to build additional Cuboids. The additional Cuboids mean the aggregation of Mandatory Dimensions that chosen in Cube Advanced Setting page. The default value is *false*. Only build Base Cuboid by default. - `kylin.stream.segment-max-fragments`: specifies the maximum number of fragments that each segment keep. The default value is *50*. - `kylin.stream.segment-min-fragments`: specifies the minimum number of fragments that each segment keep. The default value is *15*. - `kylin.stream.max-fragment-size-mb`: specifies the maximum size of each fragment. The default value is *300*. -- `kylin.stream.fragments-auto-merge-enable`: whether to enable fragments auto merge. The default value is *true*. +- `kylin.stream.fragments-auto-merge-enable`: whether to enable fragments auto merge in streaming receiver side. The default value is *true*. +- `kylin.stream.metrics.option`: specifies how to report metrics in streaming receiver side, option value are csv/console/jmx. +- `kylin.stream.event.timezone`: specifies which timezone should derived time column like `HOUR_START`/`DAY_START` used. +- `kylin.stream.auto-resubmit-after-discard-enabled`: whether to resubmit new building job automatically when finding previous job be discarded by user. -> Note: For more information, please refer to the [Real-time OLAP](http://kylin.apache.org/docs30/tutorial/real_time_olap.html). +> Note: For step by step tutorial, please refer to the [Real-time OLAP](/docs/tutorial/realtime_olap.html). ### Storage Clean up Configuration {#storage-clean-up-configuration} diff --git a/website/_docs/tutorial/lambda_mode_and_timezone_realtime_olap.md b/website/_docs/tutorial/lambda_mode_and_timezone_realtime_olap.md new file mode 100644 index 0000000..3f22997 --- /dev/null +++ b/website/_docs/tutorial/lambda_mode_and_timezone_realtime_olap.md @@ -0,0 +1,175 @@ +--- +layout: docs +title: Lambda mode and Timezone in Real-time OLAP +categories: tutorial +permalink: /docs/tutorial/lambda_mode_and_timezone_realtime_olap.html +--- + +Kylin v3.0.0 will release the real-time OLAP feature, by the power of newly added streaming reciever cluster, Kylin can query streaming data with sub-second latency. You can check [this tech blog](/blog/2019/04/12/rt-streaming-design/) for the overall design and core concept. + +If you want to find a step by step tutorial, please check this [this tech blog](/docs/tutorial/realtime_olap.html). +In this article, we will introduce how to update segment and set timezone for derived time column in realtime OLAP cube. + +# Background + +Says we have Kafka message which looks like this: + +{% highlight Groff markup %} +{ + "s_nation":"SAUDI ARABIA", + "lo_supplycost":74292, + "p_category":"MFGR#0910", + "local_day_hour_minute":"09_21_44", + "event_time":"2019-12-09 08:44:50.000-0500", + "local_day_hour":"09_21", + "lo_quantity":12, + "lo_revenue":1411548, + "p_brand":"MFGR#0910051", + "s_region":"MIDDLE EAST", + "lo_discount":5, + "customer_info":{ + "CITY":"CHINA 057", + "REGION":"ASIA", + "street":"CHINA 05721", + "NATION":"CHINA" + }, + "d_year":1994, + "d_weeknuminyear":30, + "p_mfgr":"MFGR#09", + "v_revenue":7429200, + "d_yearmonth":"Jul1994", + "s_city":"SAUDI ARA15", + "profit_ratio":0.05263157894736842, + "d_yearmonthnum":199407, + "round":1 +} +{% endhighlight %} + +This sample comes from SSB with some additional fields such as `event_time`. We have the field such as `event_time`, which stands for the timestamp of current event. +And we assume that event come from countries of different timezone, "2019-12-09 08:44:50.000-0500" indicated that event applies `America/New_York` timezone. You may have some events which come from `Asia/Shanghai` as well. + +`local_day_hour_minute` is a column which value is in local timezone, eg. "GMT+8" in the above sample. + +### Question +When perform realtime OLAP analysis with Kylin, you may have some concerns included: + +1. Will events in different timezones cause incorrect query results? +2. How could I make it correct when kafka messages contain the value which is not what you want, says some dimension value is misspelled? +3. How could I retrieve long-late messages which has been dropped? +4. My query only hit a small range of time, how should I write filter condition to make sure unused segments are purged/skipped from scan? + +### Quick Answer +For the first question, you can always get the correct result in the right timezone of location by set `kylin.stream.event.timezone=GMT+N` for all Kylin processes. By default, UTC is used for *derived time column*. + +For the second and third question, in fact you cannot update/append segment to a normal streaming cube, but you can update/append a streaming cube which in lambda mode, all you need to prepare is creating a Hive table which is mapped to your kafka event. + +For the fourth question, you can achieved this by adding *derived time column* in your filter condition like `MINUTE_START`/`DAY_START` etc. + +# How to do + +### Configure timezone +We know message may come from different timezone, but you want query results using some specific timezone. +For example, if you live in some place in GMT+2, please set `kylin.stream.event.timezone=GMT+2` for all Kylin process. + + +### Create lambda table + +You should create a hive table in *default* namespace, and this table should contains all your dimension and measure columns, please + remember to include derived time column like `MINUTE_START`/`DAY_START` if you set them in your cube's dimension column. + +Depend on which granularity level you want to update segment, you can choose HOUR_START* or `DAY_START` as partition column of this hive table. + +{% highlight Groff markup %} +use default; +CREATE EXTERNAL TABLE IF NOT EXISTS lambda_flat_table +( +-- event timestamp and debug purpose column +EVENT_TIME timestamp +,ROUND bigint COMMENT "For debug purpose, in which round did this event sent by producer" +,LOCAL_DAY_HOUR string COMMENT "For debug purpose, maybe check timezone etc" +,LOCAL_MINUTE string COMMENT "For debug purpose, maybe check timezone etc" + +-- dimension column on fact table +,LO_QUANTITY bigint +,LO_DISCOUNT bigint + +-- dimension column on dimension table +,C_REGION string +,C_NATION string +,C_CITY string + +,D_YEAR int +,D_YEARMONTH string +,D_WEEKNUMINYEAR int +,D_YEARMONTHNUM int + +,S_REGION string +,S_NATION string +,S_CITY string + +,P_CATEGORY string +,P_BRAND string +,P_MFGR string + + +-- measure column on fact table +,V_REVENUE bigint +,LO_SUPPLYCOST bigint +,LO_REVENUE bigint +,PROFIT_RATIO double + +-- for kylin used +,MINUTE_START timestamp +,HOUR_START timestamp +,MONTH_START date +) +PARTITIONED BY (DAY_START date) +STORED AS SEQUENCEFILE +LOCATION 'hdfs:///LacusDir/lambda_flat_table'; +{% endhighlight %} + + +### Create streaming cube in Kylin +The first step is to add information like broker list and topic name; +after that, you should paste sample message into left and let Kylin auto-detect the column name and column type. +You may find some data type is not correct, please fix them manually and make sure they are aligned to the data type in Hive table. + +For example, you should change the data type of event_time from varchar to timestamp. +And some column names are not the same as Hive Table, so please correct them too, such as `customer_info_REGION` to `C_REGION`. + +![image](/images/RealtimeOlap/Before-Submit.png) + +After that, please choose the right *TSColumn* *TSParser* and correct *Table Name*, table name should be identical to the name of Hive Table. After that, you should click *submit* buttom. +If you are lucky enough, table meta info will be saved successfully, otherwise please correct data type and column name according to output message. + +When you are creating Model, please set *Partition Date Column* with the right value. For streaming cube, *Partition Date Column* is used to generate HQL in updating segment which source data is from Hive. +![image](/images/RealtimeOlap/CreateStreamingModel.png) + +### Check result with timezone + +Let us do a quick check to compare whether *LOCAL_MINUTE* is aligned to *HOUR_START*. +{% highlight Groff markup %} +SELECT LOCAL_MINUTE, HOUR_START, sum(LO_SUPPLYCOST) +FROM LAMBDA_FLAT_TABLE +WHERE day_start = '2019-12-09' +GROUP BY LOCAL_MINUTE, HOUR_START +ORDER BY LOCAL_MINUTE, HOUR_START +{% endhighlight %} + +![image](/images/RealtimeOlap/Timezone-checkresult.png) + +### Update segment + +1. Use some ETL tools like spark streaming to write correct data into HDFS, and add new partition based on your new data files. +2. After that, use Rest API `http://localhost:7070/kylin/api/cubes/{cube_name}/rebuild` [Put Method] to submit a build job to replace old segments, +please add offset according to timezone in `startTime` and `endTime` if you have set `kylin.stream.event.timezone`. +3. In some case, you want to add to a lot of historical data into Kylin streaming cube to analyse(not replace something), you can also use the method. + +![image](/images/RealtimeOlap/JobMonitor.png) +![image](/images/RealtimeOlap/LambdaCubeSegment.png) + +### Some screenshots +![image](/images/RealtimeOlap/Table-Meta-1.png) +![image](/images/RealtimeOlap/Table-Meta-2.png) +![image](/images/RealtimeOlap/Table-Meta-3.png) + diff --git a/website/_docs/tutorial/real_time_olap.md b/website/_docs/tutorial/real_time_olap.md index e7e1047..588f966 100644 --- a/website/_docs/tutorial/real_time_olap.md +++ b/website/_docs/tutorial/real_time_olap.md @@ -15,8 +15,9 @@ In this tutorial, we will use Hortonworks HDP-2.4.0.0.169 Sandbox VM + Kafka v1. 4. Start consumption 5. Monitor receiver -The configuration can be found at [Real-time OLAP configuration](http://kylin.apache.org/docs30/install/configuration.html#realtime-olap). +The configuration can be found at [Real-time OLAP configuration](http://kylin.apache.org/docs/install/configuration.html#realtime-olap). The detail can be found at [Deep Dive into Real-time OLAP](http://kylin.apache.org/blog/2019/07/01/deep-dive-real-time-olap/). +If you want to configure timezone or learn how to use lambda cube, please check this [Lambda Mode and Timezone](/docs/tutorial/lambda_mode_and_timezone_realtime_olap.html) ---- @@ -238,4 +239,4 @@ When the mouse pointer moves over the segment icon, the partition level statisti - Please make sure that the port 7070 and 9090 is not occupied. If you have to change port, please do this set `kylin.stream.node` in `kylin.properties` for receiver or coordinator separately. - If you find you have messed up and want to clean up, please remove streaming metadata in Zookeeper. This can be done by executing `rmr PATH_TO_DELETE` in `zookeeper-client` shell. By default, the root dir of streaming metadata is under `kylin.env.zookeeper-base-path` + `kylin.metadata.url` + `/stream`. -For example, if you set `kylin.env.zookeeper-base-path` to `/kylin`, set `kylin.metadata.url` to `kylin_metadata@hbase`, you should delete path `/kylin/kylin_metadata/stream`. \ No newline at end of file +For example, if you set `kylin.env.zookeeper-base-path` to `/kylin`, set `kylin.metadata.url` to `kylin_metadata@hbase`, you should delete path `/kylin/kylin_metadata/stream`. diff --git a/website/_docs30/tutorial/real_time_olap.md b/website/_docs30/tutorial/real_time_olap.md index cd9de8e..3069b41 100644 --- a/website/_docs30/tutorial/real_time_olap.md +++ b/website/_docs30/tutorial/real_time_olap.md @@ -17,6 +17,7 @@ In this tutorial, we will use Hortonworks HDP-2.4.0.0.169 Sandbox VM + Kafka v1. The configuration can be found at [Real-time OLAP configuration](http://kylin.apache.org/docs30/install/configuration.html#realtime-olap). The detail can be found at [Deep Dive into Real-time OLAP](http://kylin.apache.org/blog/2019/07/01/deep-dive-real-time-olap/). +If you want to configure timezone or learn how to use lambda cube, please check this (/docs/tutorial/lambda_mode_and_timezone_realtime.html) ---- diff --git a/website/download/index.cn.md b/website/download/index.cn.md index ea58420..5626d96 100644 --- a/website/download/index.cn.md +++ b/website/download/index.cn.md @@ -6,7 +6,7 @@ title: 下载 您可以按照这些[步骤](https://www.apache.org/info/verification.html) 并使用这些[KEYS](https://www.apache.org/dist/kylin/KEYS)来验证下载文件的有效性. #### v3.0.0 -- 这是 Kylin 在 2.x 版本后开发的包含实时 OLAP 等功能的新版本。使用该版本,Kylin 支持对流式数据的亚秒级查询。请访问 [实时 OLAP 使用教程](/docs30/tutorial/realtime_olap.html) 和 [实时 OLAP 博客](/blog/2019/04/12/rt-streaming-design/) 获取详情。 +- 这是 Kylin 在 2.x 版本后开发的包含实时 OLAP 等功能的新版本。使用该版本,Kylin 支持对流式数据的亚秒级查询。请访问 [实时 OLAP 使用教程](/docs/tutorial/realtime_olap.html) 和 [实时 OLAP 博客](/blog/2019/04/12/rt-streaming-design/) 获取详情。 - [发布说明](/docs30/release_notes.html), [安装指南](/docs30/install/index.html) and [升级指南](/docs30/howto/howto_upgrade.html) - 源码下载: [apache-kylin-3.0.0-source-release.zip](https://www.apache.org/dyn/closer.cgi/kylin/apache-kylin-3.0.0/apache-kylin-3.0.0-source-release.zip) \[[asc](https://www.apache.org/dist/kylin/apache-kylin-3.0.0/apache-kylin-3.0.0-source-release.zip.asc)\] \[[sha256](https://www.apache.org/dist/kylin/apache-kylin-3.0.0/apache-kylin-3.0.0-source-release.zip.sha256)\] - Hadoop 2 二进制包: diff --git a/website/download/index.md b/website/download/index.md index 71c1ff0..beeee89 100644 --- a/website/download/index.md +++ b/website/download/index.md @@ -7,7 +7,7 @@ permalink: /download/index.html You can verify the download by following these [procedures](https://www.apache.org/info/verification.html) and using these [KEYS](https://www.apache.org/dist/kylin/KEYS). #### v3.0.0 -- This is a release of Kylin's next generation after 2.x, with the new real-time OLAP feature, Kylin can query streaming data with sub-second latency. To learn about real-time OLAP, please visit [the tech blog](/blog/2019/04/12/rt-streaming-design/) and [the tutorial](/docs30/tutorial/realtime_olap.html) for real-time OLAP. +- This is a release of Kylin's next generation after 2.x, with the new real-time OLAP feature, Kylin can query streaming data with sub-second latency. To learn about real-time OLAP, please visit [the tech blog](/blog/2019/04/12/rt-streaming-design/) and [the tutorial](/docs/tutorial/realtime_olap.html) for real-time OLAP. - [Release notes](/docs30/release_notes.html), [installation guide](/docs30/install/index.html) and [upgrade guide](/docs30/howto/howto_upgrade.html) - Source download: [apache-kylin-3.0.0-source-release.zip](https://www.apache.org/dyn/closer.cgi/kylin/apache-kylin-3.0.0/apache-kylin-3.0.0-source-release.zip) \[[asc](https://www.apache.org/dist/kylin/apache-kylin-3.0.0/apache-kylin-3.0.0-source-release.zip.asc)\] \[[sha256](https://www.apache.org/dist/kylin/apache-kylin-3.0.0/apache-kylin-3.0.0-source-release.zip.sha256)\] - Binary for Hadoop 2 download: diff --git a/website/images/RealtimeOlap/Before-Submit.png b/website/images/RealtimeOlap/Before-Submit.png new file mode 100644 index 0000000..679a86a Binary files /dev/null and b/website/images/RealtimeOlap/Before-Submit.png differ diff --git a/website/images/RealtimeOlap/CreateStreamingModel.png b/website/images/RealtimeOlap/CreateStreamingModel.png new file mode 100644 index 0000000..221414a Binary files /dev/null and b/website/images/RealtimeOlap/CreateStreamingModel.png differ diff --git a/website/images/RealtimeOlap/JobMonitor.png b/website/images/RealtimeOlap/JobMonitor.png new file mode 100644 index 0000000..266d128 Binary files /dev/null and b/website/images/RealtimeOlap/JobMonitor.png differ diff --git a/website/images/RealtimeOlap/LambdaCubeSegment.png b/website/images/RealtimeOlap/LambdaCubeSegment.png new file mode 100644 index 0000000..600d3eb Binary files /dev/null and b/website/images/RealtimeOlap/LambdaCubeSegment.png differ diff --git a/website/images/RealtimeOlap/Table-Meta-1.png b/website/images/RealtimeOlap/Table-Meta-1.png new file mode 100644 index 0000000..093d303 Binary files /dev/null and b/website/images/RealtimeOlap/Table-Meta-1.png differ diff --git a/website/images/RealtimeOlap/Table-Meta-2.png b/website/images/RealtimeOlap/Table-Meta-2.png new file mode 100644 index 0000000..820d0f2 Binary files /dev/null and b/website/images/RealtimeOlap/Table-Meta-2.png differ diff --git a/website/images/RealtimeOlap/Table-Meta-3.png b/website/images/RealtimeOlap/Table-Meta-3.png new file mode 100644 index 0000000..b3bc019 Binary files /dev/null and b/website/images/RealtimeOlap/Table-Meta-3.png differ diff --git a/website/images/RealtimeOlap/Timezone-checkresult.png b/website/images/RealtimeOlap/Timezone-checkresult.png new file mode 100644 index 0000000..2d32de1 Binary files /dev/null and b/website/images/RealtimeOlap/Timezone-checkresult.png differ