subject:"Re\: \[DISCUSS\] New Kylin Streaming Solution From eBay"

Re: Re: Re: [DISCUSS] New Kylin Streaming Solution From eBay

2018-11-03 Thread Billy Liu

Cool. It extends Kylin scenario into Real Time Query.

With Warm regards

Billy Liu

ShaoFeng Shi  于2018年11月2日周五 下午7:17写道：
>
> Hi Gang, I appreciate your hard work!
>
> Ma Gang  于2018年11月1日周四 下午3:29写道：
>
> > Hi ShaoFeng,
> > For streaming ingest/query performance, there is a doc:
> > https://drive.google.com/file/d/1GSBMpRuVQRmr8Ev2BWvssfMd-Rck9vsH/view?ths=true
> > , it is also in the design doc's 'performance' section attached in the
> > jira: https://issues.apache.org/jira/browse/KYLIN-3654
> > For stability, it is very stable in our environment, but currently it is
> > not widely used in eBay, so it is hard to say.
> > I will start to merge code to master branch, it may take some time because
> > our current version is Kylin 2.1.0, hope it can be done before Nov.30, but
> > I cannot guarantee it, there is lots of other works to do.
> >
> > At 2018-11-01 15:08:12, "ShaoFeng Shi"  wrote:
> > >Hi Gang,
> > >
> > >Thank you for the information, that is helpful for understanding the
> > >overall design and implementation.
> > >
> > >Do you have some statistical information, like performance, throughput,
> > >stability, etc.? Besides, what's the plan of contributing it to the
> > >community? Thanks!
> > >
> > >
> > >Ma Gang  于2018年11月1日周四 下午2:45写道：
> > >
> > >> Thanks Xiaoxiang,
> > >> Very good questions! Please see my comments started with [Gang]:
> > >>
> > >>
> > >> 1.  Is it possible to use Yarn as cluster manager for index task.
> > >> Coordinator process will set up them at specificed period.
> > >> [Gang] I think it is possible, but in current design,  the indexing task
> > >> is designed as long running task, it also can provide query service,
> > this
> > >> makes the whole system very simple and efficiency, I don't think we
> > need to
> > >> stop/start indexing task time by time. But use yarn to manage the
> > resource
> > >> is possible, we need to redesign the existing coordinator, to make it
> > easy
> > >> to deploy to Yarn, Kubernetes, etc. Hope this can be done after
> > >> contribution to community.
> > >>
> > >> 2.  As I know, ebay’s New Kylin Streaming Solution use replica Set
> > to
> > >> ensure that income messages wouldn’t lost if some processes  lost. I
> > think
> > >> replica set is a set of kafka cosumer processes which is responsible for
> > >> ingest message and build base cuboid in memory. Could you please show me
> > >> some detail about how replica Set provide HA guarantee? How to configure
> > >> it? A link / paper is OK.  I found one but I don’t know if it same
> > meaning
> > >> for your replica Set.
> > >>
> > >>
> > >> [Gang] Yes, it is similar as the MongoDB replication, but currently we
> > >> don't replicate data from Primary node, just assign the same Kafka
> > >> topic/partitions to the receivers in a ReplicaSet, all receivers in a
> > >> ReplicaSet will consume data from Kafka, so if one receiver is down,
> > other
> > >> receivers in the ReplicaSet are still consuming the same Kafka data, so
> > the
> > >> consume/query will not be impact. And We don't guarantee that the
> > receivers
> > >> in a ReplicaSet have the same consuming rate, but we can guarantee that
> > the
> > >> user can view data consistently by stick to the query to one receiver
> > for
> > >> one cube.
> > >> The HA implementation is a little bit naive, but simple and worked.
> > Maybe
> > >> in the future, we can do HA by replication to support other streaming
> > >> sources that don't support multiple consumers and don't have persistent
> > >> store.
> > >>
> > >> 3.  How to add or remove node of replica Set in production env? How
> > to
> > >> monitor the health/pressure of replica Set cluster ?
> > >> [Gang] Currently we have UI/restful api to let admin to add/remove node
> > >> to/from a ReplicaSet, and have a simple ui to let admin monitor the
> > health,
> > >> consuming rate for each receiver/cube. Also all metrics are collected
> > using
> > >> yammer metrics framework, it is easy to exposed to other monitor system.
> > >>
> > >> 4.  Does all measure are supported in ebay’s New Kylin Streaming
> > >> Solution? What about count distinct(bitmap)?
> > >> [Gang] Most measures are supported, but precise count distinct(bitmap)
> > is
> > >> not support in case that the distinct dimension is not int type. As you
> > >> know, to support precise count distinct for not-int type dimension, it
> > >> needs to build global dictionary, it is not possible in the streaming
> > env.
> > >>
> > >>
> > >> 5.  It seems ebay’s New Kylin Streaming Solution use a custom
> > columnar
> > >> storage, why not use a open source mature columnar storage  solution ?
> > Have
> > >> your ever compare the performance of your custom columnar storage to
> > open
> > >> source columnar storage  solution ?
> > >>
> > >> [Gang] Most open source columnar format like Parquet, ORC are designed
> > to
> > >> use in Hadoop env, the streaming data are in local disk, so I didn't
> > >> consider them at the beg

Re: Re: Re: [DISCUSS] New Kylin Streaming Solution From eBay

2018-11-02 Thread ShaoFeng Shi

Hi Gang, I appreciate your hard work!

Ma Gang  于2018年11月1日周四 下午3:29写道：

> Hi ShaoFeng,
> For streaming ingest/query performance, there is a doc:
> https://drive.google.com/file/d/1GSBMpRuVQRmr8Ev2BWvssfMd-Rck9vsH/view?ths=true
> , it is also in the design doc's 'performance' section attached in the
> jira: https://issues.apache.org/jira/browse/KYLIN-3654
> For stability, it is very stable in our environment, but currently it is
> not widely used in eBay, so it is hard to say.
> I will start to merge code to master branch, it may take some time because
> our current version is Kylin 2.1.0, hope it can be done before Nov.30, but
> I cannot guarantee it, there is lots of other works to do.
>
> At 2018-11-01 15:08:12, "ShaoFeng Shi"  wrote:
> >Hi Gang,
> >
> >Thank you for the information, that is helpful for understanding the
> >overall design and implementation.
> >
> >Do you have some statistical information, like performance, throughput,
> >stability, etc.? Besides, what's the plan of contributing it to the
> >community? Thanks!
> >
> >
> >Ma Gang  于2018年11月1日周四 下午2:45写道：
> >
> >> Thanks Xiaoxiang,
> >> Very good questions! Please see my comments started with [Gang]:
> >>
> >>
> >> 1.  Is it possible to use Yarn as cluster manager for index task.
> >> Coordinator process will set up them at specificed period.
> >> [Gang] I think it is possible, but in current design,  the indexing task
> >> is designed as long running task, it also can provide query service,
> this
> >> makes the whole system very simple and efficiency, I don't think we
> need to
> >> stop/start indexing task time by time. But use yarn to manage the
> resource
> >> is possible, we need to redesign the existing coordinator, to make it
> easy
> >> to deploy to Yarn, Kubernetes, etc. Hope this can be done after
> >> contribution to community.
> >>
> >> 2.  As I know, ebay’s New Kylin Streaming Solution use replica Set
> to
> >> ensure that income messages wouldn’t lost if some processes  lost. I
> think
> >> replica set is a set of kafka cosumer processes which is responsible for
> >> ingest message and build base cuboid in memory. Could you please show me
> >> some detail about how replica Set provide HA guarantee? How to configure
> >> it? A link / paper is OK.  I found one but I don’t know if it same
> meaning
> >> for your replica Set.
> >>
> >>
> >> [Gang] Yes, it is similar as the MongoDB replication, but currently we
> >> don't replicate data from Primary node, just assign the same Kafka
> >> topic/partitions to the receivers in a ReplicaSet, all receivers in a
> >> ReplicaSet will consume data from Kafka, so if one receiver is down,
> other
> >> receivers in the ReplicaSet are still consuming the same Kafka data, so
> the
> >> consume/query will not be impact. And We don't guarantee that the
> receivers
> >> in a ReplicaSet have the same consuming rate, but we can guarantee that
> the
> >> user can view data consistently by stick to the query to one receiver
> for
> >> one cube.
> >> The HA implementation is a little bit naive, but simple and worked.
> Maybe
> >> in the future, we can do HA by replication to support other streaming
> >> sources that don't support multiple consumers and don't have persistent
> >> store.
> >>
> >> 3.  How to add or remove node of replica Set in production env? How
> to
> >> monitor the health/pressure of replica Set cluster ?
> >> [Gang] Currently we have UI/restful api to let admin to add/remove node
> >> to/from a ReplicaSet, and have a simple ui to let admin monitor the
> health,
> >> consuming rate for each receiver/cube. Also all metrics are collected
> using
> >> yammer metrics framework, it is easy to exposed to other monitor system.
> >>
> >> 4.  Does all measure are supported in ebay’s New Kylin Streaming
> >> Solution? What about count distinct(bitmap)?
> >> [Gang] Most measures are supported, but precise count distinct(bitmap)
> is
> >> not support in case that the distinct dimension is not int type. As you
> >> know, to support precise count distinct for not-int type dimension, it
> >> needs to build global dictionary, it is not possible in the streaming
> env.
> >>
> >>
> >> 5.  It seems ebay’s New Kylin Streaming Solution use a custom
> columnar
> >> storage, why not use a open source mature columnar storage  solution ?
> Have
> >> your ever compare the performance of your custom columnar storage to
> open
> >> source columnar storage  solution ?
> >>
> >> [Gang] Most open source columnar format like Parquet, ORC are designed
> to
> >> use in Hadoop env, the streaming data are in local disk, so I didn't
> >> consider them at the beginning. It is not very hard to define columnar
> >> format to store Kylin specific data, use a customize columnar storage,
> you
> >> can use mmap file to scan data, add row-level invert index for all
> >> dimensions, so I think the performance will be better compared to using
> >> common columnar format. I didn't compare the performance, bu

Re:Re: Re: [DISCUSS] New Kylin Streaming Solution From eBay

2018-11-01 Thread Ma Gang

Hi ShaoFeng,
For streaming ingest/query performance, there is a doc: 
https://drive.google.com/file/d/1GSBMpRuVQRmr8Ev2BWvssfMd-Rck9vsH/view?ths=true 
, it is also in the design doc's 'performance' section attached in the jira: 
https://issues.apache.org/jira/browse/KYLIN-3654
For stability, it is very stable in our environment, but currently it is not 
widely used in eBay, so it is hard to say.
I will start to merge code to master branch, it may take some time because our 
current version is Kylin 2.1.0, hope it can be done before Nov.30, but I cannot 
guarantee it, there is lots of other works to do.

At 2018-11-01 15:08:12, "ShaoFeng Shi"  wrote:
>Hi Gang,
>
>Thank you for the information, that is helpful for understanding the
>overall design and implementation.
>
>Do you have some statistical information, like performance, throughput,
>stability, etc.? Besides, what's the plan of contributing it to the
>community? Thanks!
>
>
>Ma Gang  于2018年11月1日周四 下午2:45写道：
>
>> Thanks Xiaoxiang,
>> Very good questions! Please see my comments started with [Gang]:
>>
>>
>> 1.  Is it possible to use Yarn as cluster manager for index task.
>> Coordinator process will set up them at specificed period.
>> [Gang] I think it is possible, but in current design,  the indexing task
>> is designed as long running task, it also can provide query service, this
>> makes the whole system very simple and efficiency, I don't think we need to
>> stop/start indexing task time by time. But use yarn to manage the resource
>> is possible, we need to redesign the existing coordinator, to make it easy
>> to deploy to Yarn, Kubernetes, etc. Hope this can be done after
>> contribution to community.
>>
>> 2.  As I know, ebay’s New Kylin Streaming Solution use replica Set to
>> ensure that income messages wouldn’t lost if some processes  lost. I think
>> replica set is a set of kafka cosumer processes which is responsible for
>> ingest message and build base cuboid in memory. Could you please show me
>> some detail about how replica Set provide HA guarantee? How to configure
>> it? A link / paper is OK.  I found one but I don’t know if it same meaning
>> for your replica Set.
>>
>>
>> [Gang] Yes, it is similar as the MongoDB replication, but currently we
>> don't replicate data from Primary node, just assign the same Kafka
>> topic/partitions to the receivers in a ReplicaSet, all receivers in a
>> ReplicaSet will consume data from Kafka, so if one receiver is down, other
>> receivers in the ReplicaSet are still consuming the same Kafka data, so the
>> consume/query will not be impact. And We don't guarantee that the receivers
>> in a ReplicaSet have the same consuming rate, but we can guarantee that the
>> user can view data consistently by stick to the query to one receiver for
>> one cube.
>> The HA implementation is a little bit naive, but simple and worked. Maybe
>> in the future, we can do HA by replication to support other streaming
>> sources that don't support multiple consumers and don't have persistent
>> store.
>>
>> 3.  How to add or remove node of replica Set in production env? How to
>> monitor the health/pressure of replica Set cluster ?
>> [Gang] Currently we have UI/restful api to let admin to add/remove node
>> to/from a ReplicaSet, and have a simple ui to let admin monitor the health,
>> consuming rate for each receiver/cube. Also all metrics are collected using
>> yammer metrics framework, it is easy to exposed to other monitor system.
>>
>> 4.  Does all measure are supported in ebay’s New Kylin Streaming
>> Solution? What about count distinct(bitmap)?
>> [Gang] Most measures are supported, but precise count distinct(bitmap) is
>> not support in case that the distinct dimension is not int type. As you
>> know, to support precise count distinct for not-int type dimension, it
>> needs to build global dictionary, it is not possible in the streaming env.
>>
>>
>> 5.  It seems ebay’s New Kylin Streaming Solution use a custom columnar
>> storage, why not use a open source mature columnar storage  solution ? Have
>> your ever compare the performance of your custom columnar storage to open
>> source columnar storage  solution ?
>>
>> [Gang] Most open source columnar format like Parquet, ORC are designed to
>> use in Hadoop env, the streaming data are in local disk, so I didn't
>> consider them at the beginning. It is not very hard to define columnar
>> format to store Kylin specific data, use a customize columnar storage, you
>> can use mmap file to scan data, add row-level invert index for all
>> dimensions, so I think the performance will be better compared to using
>> common columnar format. I didn't compare the performance, but the storage
>> engine is pluggable, you may contribute a parquet storage if you are
>> interesting.
>>
>>
>>
>>
>>
>>
>> At 2018-11-01 12:42:25, "Xiaoxiang Yu"  wrote:
>> >Hi gang, I am so glad to know that eBay has a solution for realtime olap
>> on kylin. I have some small question:

Re: [DISCUSS] New Kylin Streaming Solution From eBay

2018-11-01 Thread Xiaoxiang Yu

Thank you for your reply. Maybe I can help to improve your Kylin Streaming 
Solution in the future.

Best wishes,
Xiaoxiang Yu

On [DATE], "[NAME]" <[ADDRESS]> wrote:

Thanks Xiaoxiang,

Very good questions! Please see my comments started with [Gang]:

1.  Is it possible to use Yarn as cluster manager for index task. 
Coordinator process will set up them at specificed period.

[Gang] I think it is possible, but in current design,  the indexing task is 
designed as long running task, it also can provide query service, this makes 
the whole system very simple and efficiency, I don't think we need to 
stop/start indexing task time by time. But use yarn to manage the resource is 
possible, we need to redesign the existing coordinator, to make it easy to 
deploy to Yarn, Kubernetes, etc. Hope this can be done after contribution to 
community.

2.  As I know, ebay’s New Kylin Streaming Solution use replica Set to 
ensure that income messages wouldn’t lost if some processes  lost. I think 
replica set is a set of kafka cosumer processes which is responsible for ingest 
message and build base cuboid in memory. Could you please show me some detail 
about how replica Set provide HA guarantee? How to configure it? A link / paper 
is OK.  I found one but I don’t know if it same meaning for your replica Set.

[Gang] Yes, it is similar as the MongoDB replication, but currently we 
don't replicate data from Primary node, just assign the same Kafka 
topic/partitions to the receivers in a ReplicaSet, all receivers in a 
ReplicaSet will consume data from Kafka, so if one receiver is down, other 
receivers in the ReplicaSet are still consuming the same Kafka data, so the 
consume/query will not be impact. And We don't guarantee that the receivers in 
a ReplicaSet have the same consuming rate, but we can guarantee that the user 
can view data consistently by stick to the query to one receiver for one cube.

The HA implementation is a little bit naive, but simple and worked. Maybe 
in the future, we can do HA by replication to support other streaming sources 
that don't support multiple consumers and don't have persistent store.

3.  How to add or remove node of replica Set in production env? How to 
monitor the health/pressure of replica Set cluster ?

[Gang] Currently we have UI/restful api to let admin to add/remove node 
to/from a ReplicaSet, and have a simple ui to let admin monitor the health, 
consuming rate for each receiver/cube. Also all metrics are collected using 
yammer metrics framework, it is easy to exposed to other monitor system.

4.  Does all measure are supported in ebay’s New Kylin Streaming 
Solution? What about count distinct(bitmap)?

[Gang] Most measures are supported, but precise count distinct(bitmap) is 
not support in case that the distinct dimension is not int type. As you know, 
to support precise count distinct for not-int type dimension, it needs to build 
global dictionary, it is not possible in the streaming env.

5.  It seems ebay’s New Kylin Streaming Solution use a custom columnar 
storage, why not use a open source mature columnar storage  solution ? Have 
your ever compare the performance of your custom columnar storage to open 
source columnar storage  solution ?

[Gang] Most open source columnar format like Parquet, ORC are designed to 
use in Hadoop env, the streaming data are in local disk, so I didn't consider 
them at the beginning. It is not very hard to define columnar format to store 
Kylin specific data, use a customize columnar storage, you can use mmap file to 
scan data, add row-level invert index for all dimensions, so I think the 
performance will be better compared to using common columnar format. I didn't 
compare the performance, but the storage engine is pluggable, you may 
contribute a parquet storage if you are interesting.

At 2018-11-01 12:42:25, "Xiaoxiang Yu"  wrote:

>Hi gang, I am so glad to know that eBay has a solution for realtime olap 
on kylin. I have some small question:

>

>

>1.  Is it possible to use Yarn as cluster manager for index task. 
Coordinator process will set up them at specificed period. Yarn will manage :

>

>a)   retry these task if some failed

>

>b)   resource allocation

>

>c)   log collection

>

>2.  As I know, ebay’s New Kylin Streaming Solution use replica Set to 
ensure that income messages wouldn’t lost if some processes  lost. I think 
replica set is a set of kafka cosumer processes which is responsible for ingest 
message and build base cuboid in memory. Could you please show me some detail 
about how replica Set provide HA guarantee? How to configure it? A link / paper 
is OK.  I found one but I don’t know if it same meaning for your replica Set.

>

>a)   [Mongodb 
replication](https://docs.mongodb.com/manual/repl

Re: Re: [DISCUSS] New Kylin Streaming Solution From eBay

2018-11-01 Thread ShaoFeng Shi

Hi Gang,

Thank you for the information, that is helpful for understanding the
overall design and implementation.

Do you have some statistical information, like performance, throughput,
stability, etc.? Besides, what's the plan of contributing it to the
community? Thanks!


Ma Gang  于2018年11月1日周四 下午2:45写道：

> Thanks Xiaoxiang,
> Very good questions! Please see my comments started with [Gang]:
>
>
> 1.  Is it possible to use Yarn as cluster manager for index task.
> Coordinator process will set up them at specificed period.
> [Gang] I think it is possible, but in current design,  the indexing task
> is designed as long running task, it also can provide query service, this
> makes the whole system very simple and efficiency, I don't think we need to
> stop/start indexing task time by time. But use yarn to manage the resource
> is possible, we need to redesign the existing coordinator, to make it easy
> to deploy to Yarn, Kubernetes, etc. Hope this can be done after
> contribution to community.
>
> 2.  As I know, ebay’s New Kylin Streaming Solution use replica Set to
> ensure that income messages wouldn’t lost if some processes  lost. I think
> replica set is a set of kafka cosumer processes which is responsible for
> ingest message and build base cuboid in memory. Could you please show me
> some detail about how replica Set provide HA guarantee? How to configure
> it? A link / paper is OK.  I found one but I don’t know if it same meaning
> for your replica Set.
>
>
> [Gang] Yes, it is similar as the MongoDB replication, but currently we
> don't replicate data from Primary node, just assign the same Kafka
> topic/partitions to the receivers in a ReplicaSet, all receivers in a
> ReplicaSet will consume data from Kafka, so if one receiver is down, other
> receivers in the ReplicaSet are still consuming the same Kafka data, so the
> consume/query will not be impact. And We don't guarantee that the receivers
> in a ReplicaSet have the same consuming rate, but we can guarantee that the
> user can view data consistently by stick to the query to one receiver for
> one cube.
> The HA implementation is a little bit naive, but simple and worked. Maybe
> in the future, we can do HA by replication to support other streaming
> sources that don't support multiple consumers and don't have persistent
> store.
>
> 3.  How to add or remove node of replica Set in production env? How to
> monitor the health/pressure of replica Set cluster ?
> [Gang] Currently we have UI/restful api to let admin to add/remove node
> to/from a ReplicaSet, and have a simple ui to let admin monitor the health,
> consuming rate for each receiver/cube. Also all metrics are collected using
> yammer metrics framework, it is easy to exposed to other monitor system.
>
> 4.  Does all measure are supported in ebay’s New Kylin Streaming
> Solution? What about count distinct(bitmap)?
> [Gang] Most measures are supported, but precise count distinct(bitmap) is
> not support in case that the distinct dimension is not int type. As you
> know, to support precise count distinct for not-int type dimension, it
> needs to build global dictionary, it is not possible in the streaming env.
>
>
> 5.  It seems ebay’s New Kylin Streaming Solution use a custom columnar
> storage, why not use a open source mature columnar storage  solution ? Have
> your ever compare the performance of your custom columnar storage to open
> source columnar storage  solution ?
>
> [Gang] Most open source columnar format like Parquet, ORC are designed to
> use in Hadoop env, the streaming data are in local disk, so I didn't
> consider them at the beginning. It is not very hard to define columnar
> format to store Kylin specific data, use a customize columnar storage, you
> can use mmap file to scan data, add row-level invert index for all
> dimensions, so I think the performance will be better compared to using
> common columnar format. I didn't compare the performance, but the storage
> engine is pluggable, you may contribute a parquet storage if you are
> interesting.
>
>
>
>
>
>
> At 2018-11-01 12:42:25, "Xiaoxiang Yu"  wrote:
> >Hi gang, I am so glad to know that eBay has a solution for realtime olap
> on kylin. I have some small question:
> >
> >
> >1.  Is it possible to use Yarn as cluster manager for index task.
> Coordinator process will set up them at specificed period. Yarn will manage
> :
> >
> >a)   retry these task if some failed
> >
> >b)   resource allocation
> >
> >c)   log collection
> >
> >2.  As I know, ebay’s New Kylin Streaming Solution use replica Set to
> ensure that income messages wouldn’t lost if some processes  lost. I think
> replica set is a set of kafka cosumer processes which is responsible for
> ingest message and build base cuboid in memory. Could you please show me
> some detail about how replica Set provide HA guarantee? How to configure
> it? A link / paper is OK.  I found one but I don’t know if it same meaning
> for y

Re: [DISCUSS] New Kylin Streaming Solution From eBay

2018-10-31 Thread Xiaoxiang Yu

Hi gang, I am so glad to know that eBay has a solution for realtime olap on 
kylin. I have some small question:


1.  Is it possible to use Yarn as cluster manager for index task. 
Coordinator process will set up them at specificed period. Yarn will manage :

a)   retry these task if some failed

b)   resource allocation

c)   log collection

2.  As I know, ebay’s New Kylin Streaming Solution use replica Set to 
ensure that income messages wouldn’t lost if some processes  lost. I think 
replica set is a set of kafka cosumer processes which is responsible for ingest 
message and build base cuboid in memory. Could you please show me some detail 
about how replica Set provide HA guarantee? How to configure it? A link / paper 
is OK.  I found one but I don’t know if it same meaning for your replica Set.

a)   [Mongodb replication](https://docs.mongodb.com/manual/replication/).

3.  How to add or remove node of replica Set in production env? How to 
monitor the health/pressure of replica Set cluster ?

4.  Does all measure are supported in ebay’s New Kylin Streaming Solution? 
What about count distinct(bitmap)?

5.  It seems ebay’s New Kylin Streaming Solution use a custom columnar 
storage, why not use a open source mature columnar storage  solution ? Have 
your ever compare the performance of your custom columnar storage to open 
source columnar storage  solution ?




Best wishes,
Xiaoxiang Yu


发件人: Ma Gang 
答复: "dev@kylin.apache.org" 
日期: 2018年10月30日 星期二 15:24
收件人: "dev@kylin.apache.org" 
主题: [DISCUSS] New Kylin Streaming Solution From eBay

Hi all,

eBay Kylin team has developed a new Kylin streaming solution, the basic idea is 
to build a streaming cluster to ingest data from streaming source(Kafka), and 
provide query for real-time data, the data preparation latency is milliseconds, 
which means the data is queryable almost when it is ingested, attach is the 
architecture design doc.
We would like to contribute the feature to community, please let us know if you 
have any concern.

Thanks,
Gang(Allen) Ma

Re:Re: Re: [DISCUSS] New Kylin Streaming Solution From eBay

2018-10-31 Thread Ma Gang

Hi ShaoFeng,
Sorry I reply the wrong email, copy my reply here for further discussion :)Very 
good questions, please see my comments start with [Gang]:1) How to bridge the 
real-time cube with a cube built from Hive? You know,
in Kylin the source type is marked at the table level, which means a table
is either a Hive table, a JDBC table or a streaming table.  To implement
the lambda architecture, how to composite the batch cube with the real-time
cube (with the same table)? This seems not mentioned in the design doc.[Gang] 
>> there is a sourceType field in TableDesc to indicate the source type, I just 
add new types for the table that has more than 1 source, for example: 
ID_KAFKA_HIVE=21, means the table source can be both Kafka and Hive.

2) How it be together with the as-is NRT (near real-time) solution
introduced in v1.6? Many users are building cube directly from Kafka,
though they are in the mini or micro batches. Can the new streaming
solution work together with the NRT cube? E.g, if I don't need to do ETL in
Hive, can I use the batch job to fetch data from Kafka, and use
the streaming real-time receivers together?[Gang] >>The new streaming solution 
is totally new, it works separately with the current streaming solution, there 
is no conflict with the NRT solution, so they can run together in the same 
Kylin platform, but currently they cannot work together as you said.
3) Does the "Build engine" of the real-time solution follow the plug-in 
architecture, so that it can support non-HBase storage? As you know we're 
implementing the parquet storage. Can this solution support other storages 
without much rework?[Gang] >>Yes, the "Build engine" follows the plug-in 
architecture, so it is easy to support non-HBase storage. In eBay, we just use 
InMemCubing, so currently we only have InMemCubing algorithm, but I think it is 
easy to extend to support LayerCubing.




At 2018-10-31 15:31:01, "ShaoFeng Shi"  wrote:
>Hi Gang,
>
>The real-time OLAP capability is pretty cool; I have a couple of questions
>here:
>
>1) How to bridge the real-time cube with a cube built from Hive? You know,
>in Kylin the source type is marked at the table level, which means a table
>is either a Hive table, a JDBC table or a streaming table.  To implement
>the lambda architecture, how to composite the batch cube with the real-time
>cube (with the same table)? This seems not mentioned in the design doc.
>2) How it be together with the as-is NRT (near real-time) solution
>introduced in v1.6? Many users are building cube directly from Kafka,
>though they are in the mini or micro batches. Can the new streaming
>solution work together with the NRT cube? E.g, if I don't need to do ETL in
>Hive, can I use the batch job to fetch data from Kafka, and use
>the streaming real-time receivers together?
>3) Does the "Build engine" of the real-time solution follow the plug-in
>architecture, so that it can support non-HBase storage? As you know we're
>implementing the parquet storage. Can this solution support other storages
>without much rework?
>
>Thanks for raising this discussion.
>
>Ma Gang  于2018年10月31日周三 上午9:57写道：
>
>> Jira ticket has been created, and the related design doc is attached in
>> the ticket: https://issues.apache.org/jira/browse/KYLIN-3654
>>
>>
>> 在 2018-10-30 21:40:34，"ShaoFeng Shi"  写道：
>> >Hi Gang,
>> >
>> >The design doc is still missing; can you upload it to somewhere and then
>> >provide a link?
>> >
>> >Ma Gang  于2018年10月30日周二 下午8:35写道：
>> >
>> >> Resend the design doc, not sure why the attachment is removed in the
>> >> previous mail.
>> >>
>> >> At 2018-10-30 15:24:01, "Ma Gang"  wrote:
>> >>
>> >> Hi all,
>> >>
>> >> eBay Kylin team has developed a new Kylin streaming solution, the basic
>> >> idea is to build a streaming cluster to ingest data from streaming
>> >> source(Kafka), and provide query for real-time data, the data
>> preparation
>> >> latency is milliseconds, which means the data is queryable almost when
>> it
>> >> is ingested, attach is the architecture design doc.
>> >> We would like to contribute the feature to community, please let us know
>> >> if you have any concern.
>> >>
>> >> Thanks,
>> >> Gang(Allen) Ma
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >
>> >
>> >--
>> >Best regards,
>> >
>> >Shaofeng Shi 史少锋
>>
>
>
>-- 
>Best regards,
>
>Shaofeng Shi 史少锋

Re: Re: [DISCUSS] New Kylin Streaming Solution From eBay

2018-10-31 Thread ShaoFeng Shi

Hi Gang,

The real-time OLAP capability is pretty cool; I have a couple of questions
here:

1) How to bridge the real-time cube with a cube built from Hive? You know,
in Kylin the source type is marked at the table level, which means a table
is either a Hive table, a JDBC table or a streaming table.  To implement
the lambda architecture, how to composite the batch cube with the real-time
cube (with the same table)? This seems not mentioned in the design doc.
2) How it be together with the as-is NRT (near real-time) solution
introduced in v1.6? Many users are building cube directly from Kafka,
though they are in the mini or micro batches. Can the new streaming
solution work together with the NRT cube? E.g, if I don't need to do ETL in
Hive, can I use the batch job to fetch data from Kafka, and use
the streaming real-time receivers together?
3) Does the "Build engine" of the real-time solution follow the plug-in
architecture, so that it can support non-HBase storage? As you know we're
implementing the parquet storage. Can this solution support other storages
without much rework?

Thanks for raising this discussion.

Ma Gang  于2018年10月31日周三 上午9:57写道：

> Jira ticket has been created, and the related design doc is attached in
> the ticket: https://issues.apache.org/jira/browse/KYLIN-3654
>
>
> 在 2018-10-30 21:40:34，"ShaoFeng Shi"  写道：
> >Hi Gang,
> >
> >The design doc is still missing; can you upload it to somewhere and then
> >provide a link?
> >
> >Ma Gang  于2018年10月30日周二 下午8:35写道：
> >
> >> Resend the design doc, not sure why the attachment is removed in the
> >> previous mail.
> >>
> >> At 2018-10-30 15:24:01, "Ma Gang"  wrote:
> >>
> >> Hi all,
> >>
> >> eBay Kylin team has developed a new Kylin streaming solution, the basic
> >> idea is to build a streaming cluster to ingest data from streaming
> >> source(Kafka), and provide query for real-time data, the data
> preparation
> >> latency is milliseconds, which means the data is queryable almost when
> it
> >> is ingested, attach is the architecture design doc.
> >> We would like to contribute the feature to community, please let us know
> >> if you have any concern.
> >>
> >> Thanks,
> >> Gang(Allen) Ma
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >
> >--
> >Best regards,
> >
> >Shaofeng Shi 史少锋
>

-- 
Best regards,

Shaofeng Shi 史少锋

Re: [DISCUSS] New Kylin Streaming Solution From eBay

2018-10-30 Thread ShaoFeng Shi

Hi Gang,

The design doc is still missing; can you upload it to somewhere and then
provide a link?

Ma Gang  于2018年10月30日周二 下午8:35写道：

> Resend the design doc, not sure why the attachment is removed in the
> previous mail.
>
> At 2018-10-30 15:24:01, "Ma Gang"  wrote:
>
> Hi all,
>
> eBay Kylin team has developed a new Kylin streaming solution, the basic
> idea is to build a streaming cluster to ingest data from streaming
> source(Kafka), and provide query for real-time data, the data preparation
> latency is milliseconds, which means the data is queryable almost when it
> is ingested, attach is the architecture design doc.
> We would like to contribute the feature to community, please let us know
> if you have any concern.
>
> Thanks,
> Gang(Allen) Ma
>
>
>
>
>
>
>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re:[DISCUSS] New Kylin Streaming Solution From eBay

2018-10-30 Thread Ma Gang

Resend the design doc, not sure why the attachment is removed in the previous 
mail.


At 2018-10-30 15:24:01, "Ma Gang"  wrote:

Hi all,


eBay Kylin team has developed a new Kylin streaming solution, the basic idea is 
to build a streaming cluster to ingest data from streaming source(Kafka), and 
provide query for real-time data, the data preparation latency is milliseconds, 
which means the data is queryable almost when it is ingested, attach is the 
architecture design doc.
We would like to contribute the feature to community, please let us know if you 
have any concern.


Thanks,
Gang(Allen) Ma

Re: Re: Re: [DISCUSS] New Kylin Streaming Solution From eBay

Re: Re: Re: [DISCUSS] New Kylin Streaming Solution From eBay

Re:Re: Re: [DISCUSS] New Kylin Streaming Solution From eBay

Re: [DISCUSS] New Kylin Streaming Solution From eBay

Re: Re: [DISCUSS] New Kylin Streaming Solution From eBay

Re: [DISCUSS] New Kylin Streaming Solution From eBay

Re:Re: Re: [DISCUSS] New Kylin Streaming Solution From eBay

Re: Re: [DISCUSS] New Kylin Streaming Solution From eBay

Re: [DISCUSS] New Kylin Streaming Solution From eBay

Re:[DISCUSS] New Kylin Streaming Solution From eBay

10 matches

Site Navigation

Mail list logo

Footer information