Re: Re: Re: [DISCUSS] New Kylin Streaming Solution From eBay

ShaoFeng Shi Fri, 02 Nov 2018 04:17:52 -0700

Hi Gang, I appreciate your hard work!

Ma Gang <[email protected]> 于2018年11月1日周四 下午3:29写道：


> Hi ShaoFeng,
> For streaming ingest/query performance, there is a doc:
> https://drive.google.com/file/d/1GSBMpRuVQRmr8Ev2BWvssfMd-Rck9vsH/view?ths=true
> , it is also in the design doc's 'performance' section attached in the
> jira: https://issues.apache.org/jira/browse/KYLIN-3654
> For stability, it is very stable in our environment, but currently it is
> not widely used in eBay, so it is hard to say.
> I will start to merge code to master branch, it may take some time because
> our current version is Kylin 2.1.0, hope it can be done before Nov.30, but
> I cannot guarantee it, there is lots of other works to do.
>
> At 2018-11-01 15:08:12, "ShaoFeng Shi" <[email protected]> wrote:
> >Hi Gang,
> >
> >Thank you for the information, that is helpful for understanding the
> >overall design and implementation.
> >
> >Do you have some statistical information, like performance, throughput,
> >stability, etc.? Besides, what's the plan of contributing it to the
> >community? Thanks!
> >
> >
> >Ma Gang <[email protected]> 于2018年11月1日周四 下午2:45写道：
> >
> >> Thanks Xiaoxiang,
> >> Very good questions! Please see my comments started with [Gang]:
> >>
> >>
> >> 1.      Is it possible to use Yarn as cluster manager for index task.
> >> Coordinator process will set up them at specificed period.
> >> [Gang] I think it is possible, but in current design,  the indexing task
> >> is designed as long running task, it also can provide query service,
> this
> >> makes the whole system very simple and efficiency, I don't think we
> need to
> >> stop/start indexing task time by time. But use yarn to manage the
> resource
> >> is possible, we need to redesign the existing coordinator, to make it
> easy
> >> to deploy to Yarn, Kubernetes, etc. Hope this can be done after
> >> contribution to community.
> >>
> >> 2.      As I know, ebay’s New Kylin Streaming Solution use replica Set
> to
> >> ensure that income messages wouldn’t lost if some processes  lost. I
> think
> >> replica set is a set of kafka cosumer processes which is responsible for
> >> ingest message and build base cuboid in memory. Could you please show me
> >> some detail about how replica Set provide HA guarantee? How to configure
> >> it? A link / paper is OK.  I found one but I don’t know if it same
> meaning
> >> for your replica Set.
> >>
> >>
> >> [Gang] Yes, it is similar as the MongoDB replication, but currently we
> >> don't replicate data from Primary node, just assign the same Kafka
> >> topic/partitions to the receivers in a ReplicaSet, all receivers in a
> >> ReplicaSet will consume data from Kafka, so if one receiver is down,
> other
> >> receivers in the ReplicaSet are still consuming the same Kafka data, so
> the
> >> consume/query will not be impact. And We don't guarantee that the
> receivers
> >> in a ReplicaSet have the same consuming rate, but we can guarantee that
> the
> >> user can view data consistently by stick to the query to one receiver
> for
> >> one cube.
> >> The HA implementation is a little bit naive, but simple and worked.
> Maybe
> >> in the future, we can do HA by replication to support other streaming
> >> sources that don't support multiple consumers and don't have persistent
> >> store.
> >>
> >> 3.      How to add or remove node of replica Set in production env? How
> to
> >> monitor the health/pressure of replica Set cluster ?
> >> [Gang] Currently we have UI/restful api to let admin to add/remove node
> >> to/from a ReplicaSet, and have a simple ui to let admin monitor the
> health,
> >> consuming rate for each receiver/cube. Also all metrics are collected
> using
> >> yammer metrics framework, it is easy to exposed to other monitor system.
> >>
> >> 4.      Does all measure are supported in ebay’s New Kylin Streaming
> >> Solution? What about count distinct(bitmap)?
> >> [Gang] Most measures are supported, but precise count distinct(bitmap)
> is
> >> not support in case that the distinct dimension is not int type. As you
> >> know, to support precise count distinct for not-int type dimension, it
> >> needs to build global dictionary, it is not possible in the streaming
> env.
> >>
> >>
> >> 5.      It seems ebay’s New Kylin Streaming Solution use a custom
> columnar
> >> storage, why not use a open source mature columnar storage  solution ?
> Have
> >> your ever compare the performance of your custom columnar storage to
> open
> >> source columnar storage  solution ?
> >>
> >> [Gang] Most open source columnar format like Parquet, ORC are designed
> to
> >> use in Hadoop env, the streaming data are in local disk, so I didn't
> >> consider them at the beginning. It is not very hard to define columnar
> >> format to store Kylin specific data, use a customize columnar storage,
> you
> >> can use mmap file to scan data, add row-level invert index for all
> >> dimensions, so I think the performance will be better compared to using
> >> common columnar format. I didn't compare the performance, but the
> storage
> >> engine is pluggable, you may contribute a parquet storage if you are
> >> interesting.
> >>
> >>
> >>
> >>
> >>
> >>
> >> At 2018-11-01 12:42:25, "Xiaoxiang Yu" <[email protected]>
> wrote:
> >> >Hi gang, I am so glad to know that eBay has a solution for realtime
> olap
> >> on kylin. I have some small question:
> >> >
> >> >
> >> >1.      Is it possible to use Yarn as cluster manager for index task.
> >> Coordinator process will set up them at specificed period. Yarn will
> manage
> >> :
> >> >
> >> >a)       retry these task if some failed
> >> >
> >> >b)       resource allocation
> >> >
> >> >c)       log collection
> >> >
> >> >2.      As I know, ebay’s New Kylin Streaming Solution use replica Set
> to
> >> ensure that income messages wouldn’t lost if some processes  lost. I
> think
> >> replica set is a set of kafka cosumer processes which is responsible for
> >> ingest message and build base cuboid in memory. Could you please show me
> >> some detail about how replica Set provide HA guarantee? How to configure
> >> it? A link / paper is OK.  I found one but I don’t know if it same
> meaning
> >> for your replica Set.
> >> >
> >> >a)       [Mongodb replication](
> >> https://docs.mongodb.com/manual/replication/).
> >> >
> >> >3.      How to add or remove node of replica Set in production env? How
> >> to monitor the health/pressure of replica Set cluster ?
> >> >
> >> >4.      Does all measure are supported in ebay’s New Kylin Streaming
> >> Solution? What about count distinct(bitmap)?
> >> >
> >> >5.      It seems ebay’s New Kylin Streaming Solution use a custom
> >> columnar storage, why not use a open source mature columnar storage
> >> solution ? Have your ever compare the performance of your custom
> columnar
> >> storage to open source columnar storage  solution ?
> >> >
> >> >
> >> >
> >> >----------------
> >> >Best wishes,
> >> >Xiaoxiang Yu
> >> >
> >> >
> >> >发件人: Ma Gang <[email protected]>
> >> >答复: "[email protected]" <[email protected]>
> >> >日期: 2018年10月30日 星期二 15:24
> >> >收件人: "[email protected]" <[email protected]>
> >> >主题: [DISCUSS] New Kylin Streaming Solution From eBay
> >> >
> >> >Hi all,
> >> >
> >> >eBay Kylin team has developed a new Kylin streaming solution, the basic
> >> idea is to build a streaming cluster to ingest data from streaming
> >> source(Kafka), and provide query for real-time data, the data
> preparation
> >> latency is milliseconds, which means the data is queryable almost when
> it
> >> is ingested, attach is the architecture design doc.
> >> >We would like to contribute the feature to community, please let us
> know
> >> if you have any concern.
> >> >
> >> >Thanks,
> >> >Gang(Allen) Ma
> >> >
> >> >
> >> >
> >> >
> >> >
> >>
> >
> >
> >--
> >Best regards,
> >
> >Shaofeng Shi 史少锋
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: Re: Re: [DISCUSS] New Kylin Streaming Solution From eBay

Reply via email to