Thanks Xiaoxiang,
Very good questions! Please see my comments started with [Gang]:
1. Is it possible to use Yarn as cluster manager for index task.
Coordinator process will set up them at specificed period.
[Gang] I think it is possible, but in current design, the indexing task is
designed as long running task, it also can provide query service, this makes
the whole system very simple and efficiency, I don't think we need to
stop/start indexing task time by time. But use yarn to manage the resource is
possible, we need to redesign the existing coordinator, to make it easy to
deploy to Yarn, Kubernetes, etc. Hope this can be done after contribution to
community.
2. As I know, ebay’s New Kylin Streaming Solution use replica Set to
ensure that income messages wouldn’t lost if some processes lost. I think
replica set is a set of kafka cosumer processes which is responsible for ingest
message and build base cuboid in memory. Could you please show me some detail
about how replica Set provide HA guarantee? How to configure it? A link / paper
is OK. I found one but I don’t know if it same meaning for your replica Set.
[Gang] Yes, it is similar as the MongoDB replication, but currently we don't
replicate data from Primary node, just assign the same Kafka topic/partitions
to the receivers in a ReplicaSet, all receivers in a ReplicaSet will consume
data from Kafka, so if one receiver is down, other receivers in the ReplicaSet
are still consuming the same Kafka data, so the consume/query will not be
impact. And We don't guarantee that the receivers in a ReplicaSet have the same
consuming rate, but we can guarantee that the user can view data consistently
by stick to the query to one receiver for one cube.
The HA implementation is a little bit naive, but simple and worked. Maybe in
the future, we can do HA by replication to support other streaming sources that
don't support multiple consumers and don't have persistent store.
3. How to add or remove node of replica Set in production env? How to
monitor the health/pressure of replica Set cluster ?
[Gang] Currently we have UI/restful api to let admin to add/remove node to/from
a ReplicaSet, and have a simple ui to let admin monitor the health, consuming
rate for each receiver/cube. Also all metrics are collected using yammer
metrics framework, it is easy to exposed to other monitor system.
4. Does all measure are supported in ebay’s New Kylin Streaming Solution?
What about count distinct(bitmap)?
[Gang] Most measures are supported, but precise count distinct(bitmap) is not
support in case that the distinct dimension is not int type. As you know, to
support precise count distinct for not-int type dimension, it needs to build
global dictionary, it is not possible in the streaming env.
5. It seems ebay’s New Kylin Streaming Solution use a custom columnar
storage, why not use a open source mature columnar storage solution ? Have
your ever compare the performance of your custom columnar storage to open
source columnar storage solution ?
[Gang] Most open source columnar format like Parquet, ORC are designed to use
in Hadoop env, the streaming data are in local disk, so I didn't consider them
at the beginning. It is not very hard to define columnar format to store Kylin
specific data, use a customize columnar storage, you can use mmap file to scan
data, add row-level invert index for all dimensions, so I think the performance
will be better compared to using common columnar format. I didn't compare the
performance, but the storage engine is pluggable, you may contribute a parquet
storage if you are interesting.
At 2018-11-01 12:42:25, "Xiaoxiang Yu" wrote:
>Hi gang, I am so glad to know that eBay has a solution for realtime olap on
>kylin. I have some small question:
>
>
>1. Is it possible to use Yarn as cluster manager for index task.
>Coordinator process will set up them at specificed period. Yarn will manage :
>
>a) retry these task if some failed
>
>b) resource allocation
>
>c) log collection
>
>2. As I know, ebay’s New Kylin Streaming Solution use replica Set to
>ensure that income messages wouldn’t lost if some processes lost. I think
>replica set is a set of kafka cosumer processes which is responsible for
>ingest message and build base cuboid in memory. Could you please show me some
>detail about how replica Set provide HA guarantee? How to configure it? A link
>/ paper is OK. I found one but I don’t know if it same meaning for your
>replica Set.
>
>a) [Mongodb replication](https://docs.mongodb.com/manual/replication/).
>
>3. How to add or remove node of replica Set in production env? How to
>monitor the health/pressure of replica Set cluster ?
>
>4. Does all measure are supported in ebay’s New Kylin Streaming Solution?
>What about count distinct(bitmap)?
>
>5. It seems ebay’s New Kylin Streaming Solution use a custom column