Hi ShaoFeng, For streaming ingest/query performance, there is a doc: https://drive.google.com/file/d/1GSBMpRuVQRmr8Ev2BWvssfMd-Rck9vsH/view?ths=true , it is also in the design doc's 'performance' section attached in the jira: https://issues.apache.org/jira/browse/KYLIN-3654 For stability, it is very stable in our environment, but currently it is not widely used in eBay, so it is hard to say. I will start to merge code to master branch, it may take some time because our current version is Kylin 2.1.0, hope it can be done before Nov.30, but I cannot guarantee it, there is lots of other works to do.
At 2018-11-01 15:08:12, "ShaoFeng Shi" <[email protected]> wrote: >Hi Gang, > >Thank you for the information, that is helpful for understanding the >overall design and implementation. > >Do you have some statistical information, like performance, throughput, >stability, etc.? Besides, what's the plan of contributing it to the >community? Thanks! > > >Ma Gang <[email protected]> 于2018年11月1日周四 下午2:45写道: > >> Thanks Xiaoxiang, >> Very good questions! Please see my comments started with [Gang]: >> >> >> 1. Is it possible to use Yarn as cluster manager for index task. >> Coordinator process will set up them at specificed period. >> [Gang] I think it is possible, but in current design, the indexing task >> is designed as long running task, it also can provide query service, this >> makes the whole system very simple and efficiency, I don't think we need to >> stop/start indexing task time by time. But use yarn to manage the resource >> is possible, we need to redesign the existing coordinator, to make it easy >> to deploy to Yarn, Kubernetes, etc. Hope this can be done after >> contribution to community. >> >> 2. As I know, ebay’s New Kylin Streaming Solution use replica Set to >> ensure that income messages wouldn’t lost if some processes lost. I think >> replica set is a set of kafka cosumer processes which is responsible for >> ingest message and build base cuboid in memory. Could you please show me >> some detail about how replica Set provide HA guarantee? How to configure >> it? A link / paper is OK. I found one but I don’t know if it same meaning >> for your replica Set. >> >> >> [Gang] Yes, it is similar as the MongoDB replication, but currently we >> don't replicate data from Primary node, just assign the same Kafka >> topic/partitions to the receivers in a ReplicaSet, all receivers in a >> ReplicaSet will consume data from Kafka, so if one receiver is down, other >> receivers in the ReplicaSet are still consuming the same Kafka data, so the >> consume/query will not be impact. And We don't guarantee that the receivers >> in a ReplicaSet have the same consuming rate, but we can guarantee that the >> user can view data consistently by stick to the query to one receiver for >> one cube. >> The HA implementation is a little bit naive, but simple and worked. Maybe >> in the future, we can do HA by replication to support other streaming >> sources that don't support multiple consumers and don't have persistent >> store. >> >> 3. How to add or remove node of replica Set in production env? How to >> monitor the health/pressure of replica Set cluster ? >> [Gang] Currently we have UI/restful api to let admin to add/remove node >> to/from a ReplicaSet, and have a simple ui to let admin monitor the health, >> consuming rate for each receiver/cube. Also all metrics are collected using >> yammer metrics framework, it is easy to exposed to other monitor system. >> >> 4. Does all measure are supported in ebay’s New Kylin Streaming >> Solution? What about count distinct(bitmap)? >> [Gang] Most measures are supported, but precise count distinct(bitmap) is >> not support in case that the distinct dimension is not int type. As you >> know, to support precise count distinct for not-int type dimension, it >> needs to build global dictionary, it is not possible in the streaming env. >> >> >> 5. It seems ebay’s New Kylin Streaming Solution use a custom columnar >> storage, why not use a open source mature columnar storage solution ? Have >> your ever compare the performance of your custom columnar storage to open >> source columnar storage solution ? >> >> [Gang] Most open source columnar format like Parquet, ORC are designed to >> use in Hadoop env, the streaming data are in local disk, so I didn't >> consider them at the beginning. It is not very hard to define columnar >> format to store Kylin specific data, use a customize columnar storage, you >> can use mmap file to scan data, add row-level invert index for all >> dimensions, so I think the performance will be better compared to using >> common columnar format. I didn't compare the performance, but the storage >> engine is pluggable, you may contribute a parquet storage if you are >> interesting. >> >> >> >> >> >> >> At 2018-11-01 12:42:25, "Xiaoxiang Yu" <[email protected]> wrote: >> >Hi gang, I am so glad to know that eBay has a solution for realtime olap >> on kylin. I have some small question: >> > >> > >> >1. Is it possible to use Yarn as cluster manager for index task. >> Coordinator process will set up them at specificed period. Yarn will manage >> : >> > >> >a) retry these task if some failed >> > >> >b) resource allocation >> > >> >c) log collection >> > >> >2. As I know, ebay’s New Kylin Streaming Solution use replica Set to >> ensure that income messages wouldn’t lost if some processes lost. I think >> replica set is a set of kafka cosumer processes which is responsible for >> ingest message and build base cuboid in memory. Could you please show me >> some detail about how replica Set provide HA guarantee? How to configure >> it? A link / paper is OK. I found one but I don’t know if it same meaning >> for your replica Set. >> > >> >a) [Mongodb replication]( >> https://docs.mongodb.com/manual/replication/). >> > >> >3. How to add or remove node of replica Set in production env? How >> to monitor the health/pressure of replica Set cluster ? >> > >> >4. Does all measure are supported in ebay’s New Kylin Streaming >> Solution? What about count distinct(bitmap)? >> > >> >5. It seems ebay’s New Kylin Streaming Solution use a custom >> columnar storage, why not use a open source mature columnar storage >> solution ? Have your ever compare the performance of your custom columnar >> storage to open source columnar storage solution ? >> > >> > >> > >> >---------------- >> >Best wishes, >> >Xiaoxiang Yu >> > >> > >> >发件人: Ma Gang <[email protected]> >> >答复: "[email protected]" <[email protected]> >> >日期: 2018年10月30日 星期二 15:24 >> >收件人: "[email protected]" <[email protected]> >> >主题: [DISCUSS] New Kylin Streaming Solution From eBay >> > >> >Hi all, >> > >> >eBay Kylin team has developed a new Kylin streaming solution, the basic >> idea is to build a streaming cluster to ingest data from streaming >> source(Kafka), and provide query for real-time data, the data preparation >> latency is milliseconds, which means the data is queryable almost when it >> is ingested, attach is the architecture design doc. >> >We would like to contribute the feature to community, please let us know >> if you have any concern. >> > >> >Thanks, >> >Gang(Allen) Ma >> > >> > >> > >> > >> > >> > > >-- >Best regards, > >Shaofeng Shi 史少锋
