Re: [DISCUSS] New Kylin Streaming Solution From eBay

Xiaoxiang Yu Thu, 01 Nov 2018 00:13:12 -0700

Thank you for your reply. Maybe I can help to improve your Kylin Streaming 
Solution in the future.



----------------
Best wishes,
Xiaoxiang Yu





On [DATE], "[NAME]" <[ADDRESS]> wrote:



    Thanks Xiaoxiang,

    Very good questions! Please see my comments started with [Gang]:





    1.      Is it possible to use Yarn as cluster manager for index task. 
Coordinator process will set up them at specificed period.

    [Gang] I think it is possible, but in current design,  the indexing task is 
designed as long running task, it also can provide query service, this makes 
the whole system very simple and efficiency, I don't think we need to 
stop/start indexing task time by time. But use yarn to manage the resource is 
possible, we need to redesign the existing coordinator, to make it easy to 
deploy to Yarn, Kubernetes, etc. Hope this can be done after contribution to 
community.



    2.      As I know, ebay’s New Kylin Streaming Solution use replica Set to 
ensure that income messages wouldn’t lost if some processes  lost. I think 
replica set is a set of kafka cosumer processes which is responsible for ingest 
message and build base cuboid in memory. Could you please show me some detail 
about how replica Set provide HA guarantee? How to configure it? A link / paper 
is OK.  I found one but I don’t know if it same meaning for your replica Set.





    [Gang] Yes, it is similar as the MongoDB replication, but currently we 
don't replicate data from Primary node, just assign the same Kafka 
topic/partitions to the receivers in a ReplicaSet, all receivers in a 
ReplicaSet will consume data from Kafka, so if one receiver is down, other 
receivers in the ReplicaSet are still consuming the same Kafka data, so the 
consume/query will not be impact. And We don't guarantee that the receivers in 
a ReplicaSet have the same consuming rate, but we can guarantee that the user 
can view data consistently by stick to the query to one receiver for one cube.

    The HA implementation is a little bit naive, but simple and worked. Maybe 
in the future, we can do HA by replication to support other streaming sources 
that don't support multiple consumers and don't have persistent store.



    3.      How to add or remove node of replica Set in production env? How to 
monitor the health/pressure of replica Set cluster ?

    [Gang] Currently we have UI/restful api to let admin to add/remove node 
to/from a ReplicaSet, and have a simple ui to let admin monitor the health, 
consuming rate for each receiver/cube. Also all metrics are collected using 
yammer metrics framework, it is easy to exposed to other monitor system.



    4.      Does all measure are supported in ebay’s New Kylin Streaming 
Solution? What about count distinct(bitmap)?

    [Gang] Most measures are supported, but precise count distinct(bitmap) is 
not support in case that the distinct dimension is not int type. As you know, 
to support precise count distinct for not-int type dimension, it needs to build 
global dictionary, it is not possible in the streaming env.





    5.      It seems ebay’s New Kylin Streaming Solution use a custom columnar 
storage, why not use a open source mature columnar storage  solution ? Have 
your ever compare the performance of your custom columnar storage to open 
source columnar storage  solution ?



    [Gang] Most open source columnar format like Parquet, ORC are designed to 
use in Hadoop env, the streaming data are in local disk, so I didn't consider 
them at the beginning. It is not very hard to define columnar format to store 
Kylin specific data, use a customize columnar storage, you can use mmap file to 
scan data, add row-level invert index for all dimensions, so I think the 
performance will be better compared to using common columnar format. I didn't 
compare the performance, but the storage engine is pluggable, you may 
contribute a parquet storage if you are interesting.













    At 2018-11-01 12:42:25, "Xiaoxiang Yu" <[email protected]> wrote:

    >Hi gang, I am so glad to know that eBay has a solution for realtime olap 
on kylin. I have some small question:

    >

    >

    >1.      Is it possible to use Yarn as cluster manager for index task. 
Coordinator process will set up them at specificed period. Yarn will manage :

    >

    >a)       retry these task if some failed

    >

    >b)       resource allocation

    >

    >c)       log collection

    >

    >2.      As I know, ebay’s New Kylin Streaming Solution use replica Set to 
ensure that income messages wouldn’t lost if some processes  lost. I think 
replica set is a set of kafka cosumer processes which is responsible for ingest 
message and build base cuboid in memory. Could you please show me some detail 
about how replica Set provide HA guarantee? How to configure it? A link / paper 
is OK.  I found one but I don’t know if it same meaning for your replica Set.

    >

    >a)       [Mongodb 
replication](https://docs.mongodb.com/manual/replication/).

    >

    >3.      How to add or remove node of replica Set in production env? How to 
monitor the health/pressure of replica Set cluster ?

    >

    >4.      Does all measure are supported in ebay’s New Kylin Streaming 
Solution? What about count distinct(bitmap)?

    >

    >5.      It seems ebay’s New Kylin Streaming Solution use a custom columnar 
storage, why not use a open source mature columnar storage  solution ? Have 
your ever compare the performance of your custom columnar storage to open 
source columnar storage  solution ?

    >

    >

    >

    >----------------

    >Best wishes,

    >Xiaoxiang Yu

    >

    >

    >发件人: Ma Gang <[email protected]>

    >答复: "[email protected]" <[email protected]>

    >日期: 2018年10月30日 星期二 15:24

    >收件人: "[email protected]" <[email protected]>

    >主题: [DISCUSS] New Kylin Streaming Solution From eBay

    >

    >Hi all,

    >

    >eBay Kylin team has developed a new Kylin streaming solution, the basic 
idea is to build a streaming cluster to ingest data from streaming 
source(Kafka), and provide query for real-time data, the data preparation 
latency is milliseconds, which means the data is queryable almost when it is 
ingested, attach is the architecture design doc.

    >We would like to contribute the feature to community, please let us know 
if you have any concern.

    >

    >Thanks,

    >Gang(Allen) Ma

    >

    >

    >

    >

    >

Re: [DISCUSS] New Kylin Streaming Solution From eBay

Reply via email to