Hello Hemant,

Thanks for your reply.

I think partitioning you described (event type + protocol type) is subject
to data skew. Including a device ID should solve this problem.
Also, including "protocol_type" into the key and having topic per
protocol_type seems redundant.

Furthermore, do you have any particular reason to maintain multiple topics?
I could imagine protocols have different speeds or other characteristics,
so you can tune Flink accordingly.
Otherwise, having a single topic partitioned only by device ID would
simplify deployment and reduce data skew.

> By consume do you mean the downstream system?
Yes.

Regards,
Roman


On Mon, May 11, 2020 at 11:30 PM hemant singh <hemant2...@gmail.com> wrote:

> Hello Roman,
>
> PFB my response -
>
> As I understand, each protocol has a distinct set of event types (where
> event type == metrics type); and a distinct set of devices. Is this correct?
> Yes, correct. distinct events and devices. Each device emits these event.
>
> > Based on data protocol I have 4-5 topics. Currently the data for a
> single event is being pushed to a partition of the kafka topic(producer key
> -> event_type + data_protocol).
> Here you are talking about the source (to Flink job), right?
> Yes, you are right.
>
> Can you also share how are you going to consume these data?
> By consume do you mean the downstream system?
> If yes then this data will be written to a DB, some metrics goes to
> TSDB(Influx) as well.
>
> Thanks,
> Hemant
>
> On Tue, May 12, 2020 at 2:08 AM Khachatryan Roman <
> khachatryan.ro...@gmail.com> wrote:
>
>> Hi Hemant,
>>
>> As I understand, each protocol has a distinct set of event types (where
>> event type == metrics type); and a distinct set of devices. Is this correct?
>>
>> > Based on data protocol I have 4-5 topics. Currently the data for a
>> single event is being pushed to a partition of the kafka topic(producer key
>> -> event_type + data_protocol).
>> Here you are talking about the source (to Flink job), right?
>>
>> Can you also share how are you going to consume these data?
>>
>>
>> Regards,
>> Roman
>>
>>
>> On Mon, May 11, 2020 at 8:57 PM hemant singh <hemant2...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I have different events from a device which constitutes different
>>> metrics for same device. Each of these event is produced by the device in
>>> interval of few milli seconds to a minute.
>>>
>>> Event1(Device1) -> Stream1 -> Metric 1
>>> Event2 (Device1) -> Stream2 -> Metric 2 ...
>>> ..............
>>> .......
>>> Event100(Device1) -> Stream100 -> Metric100
>>>
>>> The number of events can go up to few 100s for each data protocol and we
>>> have around 4-5 data protocols. Metrics from different streams makes up a
>>> records
>>> like for example from above example for device 1 -
>>>
>>> Device1 -> Metric1, Metric 2, Metric15 forms a single record for the
>>> device. Currently in development phase I am using interval join to achieve
>>> this, that is to create a record with latest data from different
>>> streams(events).
>>>
>>> Based on data protocol I have 4-5 topics. Currently the data for a
>>> single event is being pushed to a partition of the kafka topic(producer key
>>> -> event_type + data_protocol). So essentially one topic is made up of many
>>> streams. I am filtering on the key to define the streams.
>>>
>>> My question is - Is this correct way to stream the data, I had thought
>>> of maintaining different topic for an event, however in that case number of
>>> topics could go to few thousands and that is something which becomes little
>>> challenging to maintain and not sure if kafka handles that well.
>>>
>>> I know there are traditional ways to do this like pushing it to
>>> timeseries db and then joining data for different metric but that is
>>> something which will never scale, also this processing should be as
>>> realtime as possible.
>>>
>>> Are there better ways to handle this use case or I am on correct path.
>>>
>>> Thanks,
>>> Hemant
>>>
>>

Reply via email to