Hi Gang, I appreciate your hard work! Ma Gang <mg4w...@163.com> 于2018年11月1日周四 下午3:29写道:
> Hi ShaoFeng, > For streaming ingest/query performance, there is a doc: > https://drive.google.com/file/d/1GSBMpRuVQRmr8Ev2BWvssfMd-Rck9vsH/view?ths=true > , it is also in the design doc's 'performance' section attached in the > jira: https://issues.apache.org/jira/browse/KYLIN-3654 > For stability, it is very stable in our environment, but currently it is > not widely used in eBay, so it is hard to say. > I will start to merge code to master branch, it may take some time because > our current version is Kylin 2.1.0, hope it can be done before Nov.30, but > I cannot guarantee it, there is lots of other works to do. > > At 2018-11-01 15:08:12, "ShaoFeng Shi" <shaofeng...@apache.org> wrote: > >Hi Gang, > > > >Thank you for the information, that is helpful for understanding the > >overall design and implementation. > > > >Do you have some statistical information, like performance, throughput, > >stability, etc.? Besides, what's the plan of contributing it to the > >community? Thanks! > > > > > >Ma Gang <mg4w...@163.com> 于2018年11月1日周四 下午2:45写道: > > > >> Thanks Xiaoxiang, > >> Very good questions! Please see my comments started with [Gang]: > >> > >> > >> 1. Is it possible to use Yarn as cluster manager for index task. > >> Coordinator process will set up them at specificed period. > >> [Gang] I think it is possible, but in current design, the indexing task > >> is designed as long running task, it also can provide query service, > this > >> makes the whole system very simple and efficiency, I don't think we > need to > >> stop/start indexing task time by time. But use yarn to manage the > resource > >> is possible, we need to redesign the existing coordinator, to make it > easy > >> to deploy to Yarn, Kubernetes, etc. Hope this can be done after > >> contribution to community. > >> > >> 2. As I know, ebay’s New Kylin Streaming Solution use replica Set > to > >> ensure that income messages wouldn’t lost if some processes lost. I > think > >> replica set is a set of kafka cosumer processes which is responsible for > >> ingest message and build base cuboid in memory. Could you please show me > >> some detail about how replica Set provide HA guarantee? How to configure > >> it? A link / paper is OK. I found one but I don’t know if it same > meaning > >> for your replica Set. > >> > >> > >> [Gang] Yes, it is similar as the MongoDB replication, but currently we > >> don't replicate data from Primary node, just assign the same Kafka > >> topic/partitions to the receivers in a ReplicaSet, all receivers in a > >> ReplicaSet will consume data from Kafka, so if one receiver is down, > other > >> receivers in the ReplicaSet are still consuming the same Kafka data, so > the > >> consume/query will not be impact. And We don't guarantee that the > receivers > >> in a ReplicaSet have the same consuming rate, but we can guarantee that > the > >> user can view data consistently by stick to the query to one receiver > for > >> one cube. > >> The HA implementation is a little bit naive, but simple and worked. > Maybe > >> in the future, we can do HA by replication to support other streaming > >> sources that don't support multiple consumers and don't have persistent > >> store. > >> > >> 3. How to add or remove node of replica Set in production env? How > to > >> monitor the health/pressure of replica Set cluster ? > >> [Gang] Currently we have UI/restful api to let admin to add/remove node > >> to/from a ReplicaSet, and have a simple ui to let admin monitor the > health, > >> consuming rate for each receiver/cube. Also all metrics are collected > using > >> yammer metrics framework, it is easy to exposed to other monitor system. > >> > >> 4. Does all measure are supported in ebay’s New Kylin Streaming > >> Solution? What about count distinct(bitmap)? > >> [Gang] Most measures are supported, but precise count distinct(bitmap) > is > >> not support in case that the distinct dimension is not int type. As you > >> know, to support precise count distinct for not-int type dimension, it > >> needs to build global dictionary, it is not possible in the streaming > env. > >> > >> > >> 5. It seems ebay’s New Kylin Streaming Solution use a custom > columnar > >> storage, why not use a open source mature columnar storage solution ? > Have > >> your ever compare the performance of your custom columnar storage to > open > >> source columnar storage solution ? > >> > >> [Gang] Most open source columnar format like Parquet, ORC are designed > to > >> use in Hadoop env, the streaming data are in local disk, so I didn't > >> consider them at the beginning. It is not very hard to define columnar > >> format to store Kylin specific data, use a customize columnar storage, > you > >> can use mmap file to scan data, add row-level invert index for all > >> dimensions, so I think the performance will be better compared to using > >> common columnar format. I didn't compare the performance, but the > storage > >> engine is pluggable, you may contribute a parquet storage if you are > >> interesting. > >> > >> > >> > >> > >> > >> > >> At 2018-11-01 12:42:25, "Xiaoxiang Yu" <xiaoxiang...@kyligence.io> > wrote: > >> >Hi gang, I am so glad to know that eBay has a solution for realtime > olap > >> on kylin. I have some small question: > >> > > >> > > >> >1. Is it possible to use Yarn as cluster manager for index task. > >> Coordinator process will set up them at specificed period. Yarn will > manage > >> : > >> > > >> >a) retry these task if some failed > >> > > >> >b) resource allocation > >> > > >> >c) log collection > >> > > >> >2. As I know, ebay’s New Kylin Streaming Solution use replica Set > to > >> ensure that income messages wouldn’t lost if some processes lost. I > think > >> replica set is a set of kafka cosumer processes which is responsible for > >> ingest message and build base cuboid in memory. Could you please show me > >> some detail about how replica Set provide HA guarantee? How to configure > >> it? A link / paper is OK. I found one but I don’t know if it same > meaning > >> for your replica Set. > >> > > >> >a) [Mongodb replication]( > >> https://docs.mongodb.com/manual/replication/). > >> > > >> >3. How to add or remove node of replica Set in production env? How > >> to monitor the health/pressure of replica Set cluster ? > >> > > >> >4. Does all measure are supported in ebay’s New Kylin Streaming > >> Solution? What about count distinct(bitmap)? > >> > > >> >5. It seems ebay’s New Kylin Streaming Solution use a custom > >> columnar storage, why not use a open source mature columnar storage > >> solution ? Have your ever compare the performance of your custom > columnar > >> storage to open source columnar storage solution ? > >> > > >> > > >> > > >> >---------------- > >> >Best wishes, > >> >Xiaoxiang Yu > >> > > >> > > >> >发件人: Ma Gang <mg4w...@163.com> > >> >答复: "dev@kylin.apache.org" <dev@kylin.apache.org> > >> >日期: 2018年10月30日 星期二 15:24 > >> >收件人: "dev@kylin.apache.org" <dev@kylin.apache.org> > >> >主题: [DISCUSS] New Kylin Streaming Solution From eBay > >> > > >> >Hi all, > >> > > >> >eBay Kylin team has developed a new Kylin streaming solution, the basic > >> idea is to build a streaming cluster to ingest data from streaming > >> source(Kafka), and provide query for real-time data, the data > preparation > >> latency is milliseconds, which means the data is queryable almost when > it > >> is ingested, attach is the architecture design doc. > >> >We would like to contribute the feature to community, please let us > know > >> if you have any concern. > >> > > >> >Thanks, > >> >Gang(Allen) Ma > >> > > >> > > >> > > >> > > >> > > >> > > > > > >-- > >Best regards, > > > >Shaofeng Shi 史少锋 > -- Best regards, Shaofeng Shi 史少锋