Re:Re: Re: [DISCUSS] New Kylin Streaming Solution From eBay

Ma Gang Thu, 01 Nov 2018 00:29:20 -0700

Hi ShaoFeng,
For streaming ingest/query performance, there is a doc: 
https://drive.google.com/file/d/1GSBMpRuVQRmr8Ev2BWvssfMd-Rck9vsH/view?ths=true 
, it is also in the design doc's 'performance' section attached in the jira: 
https://issues.apache.org/jira/browse/KYLIN-3654
For stability, it is very stable in our environment, but currently it is not 
widely used in eBay, so it is hard to say.
I will start to merge code to master branch, it may take some time because our 
current version is Kylin 2.1.0, hope it can be done before Nov.30, but I cannot 
guarantee it, there is lots of other works to do.


At 2018-11-01 15:08:12, "ShaoFeng Shi" <[email protected]> wrote:
>Hi Gang,
>
>Thank you for the information, that is helpful for understanding the
>overall design and implementation.
>
>Do you have some statistical information, like performance, throughput,
>stability, etc.? Besides, what's the plan of contributing it to the
>community? Thanks!
>
>
>Ma Gang <[email protected]> 于2018年11月1日周四 下午2:45写道：
>
>> Thanks Xiaoxiang,
>> Very good questions! Please see my comments started with [Gang]:
>>
>>
>> 1.      Is it possible to use Yarn as cluster manager for index task.
>> Coordinator process will set up them at specificed period.
>> [Gang] I think it is possible, but in current design,  the indexing task
>> is designed as long running task, it also can provide query service, this
>> makes the whole system very simple and efficiency, I don't think we need to
>> stop/start indexing task time by time. But use yarn to manage the resource
>> is possible, we need to redesign the existing coordinator, to make it easy
>> to deploy to Yarn, Kubernetes, etc. Hope this can be done after
>> contribution to community.
>>
>> 2.      As I know, ebay’s New Kylin Streaming Solution use replica Set to
>> ensure that income messages wouldn’t lost if some processes  lost. I think
>> replica set is a set of kafka cosumer processes which is responsible for
>> ingest message and build base cuboid in memory. Could you please show me
>> some detail about how replica Set provide HA guarantee? How to configure
>> it? A link / paper is OK.  I found one but I don’t know if it same meaning
>> for your replica Set.
>>
>>
>> [Gang] Yes, it is similar as the MongoDB replication, but currently we
>> don't replicate data from Primary node, just assign the same Kafka
>> topic/partitions to the receivers in a ReplicaSet, all receivers in a
>> ReplicaSet will consume data from Kafka, so if one receiver is down, other
>> receivers in the ReplicaSet are still consuming the same Kafka data, so the
>> consume/query will not be impact. And We don't guarantee that the receivers
>> in a ReplicaSet have the same consuming rate, but we can guarantee that the
>> user can view data consistently by stick to the query to one receiver for
>> one cube.
>> The HA implementation is a little bit naive, but simple and worked. Maybe
>> in the future, we can do HA by replication to support other streaming
>> sources that don't support multiple consumers and don't have persistent
>> store.
>>
>> 3.      How to add or remove node of replica Set in production env? How to
>> monitor the health/pressure of replica Set cluster ?
>> [Gang] Currently we have UI/restful api to let admin to add/remove node
>> to/from a ReplicaSet, and have a simple ui to let admin monitor the health,
>> consuming rate for each receiver/cube. Also all metrics are collected using
>> yammer metrics framework, it is easy to exposed to other monitor system.
>>
>> 4.      Does all measure are supported in ebay’s New Kylin Streaming
>> Solution? What about count distinct(bitmap)?
>> [Gang] Most measures are supported, but precise count distinct(bitmap) is
>> not support in case that the distinct dimension is not int type. As you
>> know, to support precise count distinct for not-int type dimension, it
>> needs to build global dictionary, it is not possible in the streaming env.
>>
>>
>> 5.      It seems ebay’s New Kylin Streaming Solution use a custom columnar
>> storage, why not use a open source mature columnar storage  solution ? Have
>> your ever compare the performance of your custom columnar storage to open
>> source columnar storage  solution ?
>>
>> [Gang] Most open source columnar format like Parquet, ORC are designed to
>> use in Hadoop env, the streaming data are in local disk, so I didn't
>> consider them at the beginning. It is not very hard to define columnar
>> format to store Kylin specific data, use a customize columnar storage, you
>> can use mmap file to scan data, add row-level invert index for all
>> dimensions, so I think the performance will be better compared to using
>> common columnar format. I didn't compare the performance, but the storage
>> engine is pluggable, you may contribute a parquet storage if you are
>> interesting.
>>
>>
>>
>>
>>
>>
>> At 2018-11-01 12:42:25, "Xiaoxiang Yu" <[email protected]> wrote:
>> >Hi gang, I am so glad to know that eBay has a solution for realtime olap
>> on kylin. I have some small question:
>> >
>> >
>> >1.      Is it possible to use Yarn as cluster manager for index task.
>> Coordinator process will set up them at specificed period. Yarn will manage
>> :
>> >
>> >a)       retry these task if some failed
>> >
>> >b)       resource allocation
>> >
>> >c)       log collection
>> >
>> >2.      As I know, ebay’s New Kylin Streaming Solution use replica Set to
>> ensure that income messages wouldn’t lost if some processes  lost. I think
>> replica set is a set of kafka cosumer processes which is responsible for
>> ingest message and build base cuboid in memory. Could you please show me
>> some detail about how replica Set provide HA guarantee? How to configure
>> it? A link / paper is OK.  I found one but I don’t know if it same meaning
>> for your replica Set.
>> >
>> >a)       [Mongodb replication](
>> https://docs.mongodb.com/manual/replication/).
>> >
>> >3.      How to add or remove node of replica Set in production env? How
>> to monitor the health/pressure of replica Set cluster ?
>> >
>> >4.      Does all measure are supported in ebay’s New Kylin Streaming
>> Solution? What about count distinct(bitmap)?
>> >
>> >5.      It seems ebay’s New Kylin Streaming Solution use a custom
>> columnar storage, why not use a open source mature columnar storage
>> solution ? Have your ever compare the performance of your custom columnar
>> storage to open source columnar storage  solution ?
>> >
>> >
>> >
>> >----------------
>> >Best wishes,
>> >Xiaoxiang Yu
>> >
>> >
>> >发件人: Ma Gang <[email protected]>
>> >答复: "[email protected]" <[email protected]>
>> >日期: 2018年10月30日 星期二 15:24
>> >收件人: "[email protected]" <[email protected]>
>> >主题: [DISCUSS] New Kylin Streaming Solution From eBay
>> >
>> >Hi all,
>> >
>> >eBay Kylin team has developed a new Kylin streaming solution, the basic
>> idea is to build a streaming cluster to ingest data from streaming
>> source(Kafka), and provide query for real-time data, the data preparation
>> latency is milliseconds, which means the data is queryable almost when it
>> is ingested, attach is the architecture design doc.
>> >We would like to contribute the feature to community, please let us know
>> if you have any concern.
>> >
>> >Thanks,
>> >Gang(Allen) Ma
>> >
>> >
>> >
>> >
>> >
>>
>
>
>-- 
>Best regards,
>
>Shaofeng Shi 史少锋

Re:Re: Re: [DISCUSS] New Kylin Streaming Solution From eBay

Reply via email to