Hello Hemant, Thanks for your reply.
I think partitioning you described (event type + protocol type) is subject to data skew. Including a device ID should solve this problem. Also, including "protocol_type" into the key and having topic per protocol_type seems redundant. Furthermore, do you have any particular reason to maintain multiple topics? I could imagine protocols have different speeds or other characteristics, so you can tune Flink accordingly. Otherwise, having a single topic partitioned only by device ID would simplify deployment and reduce data skew. > By consume do you mean the downstream system? Yes. Regards, Roman On Mon, May 11, 2020 at 11:30 PM hemant singh <hemant2...@gmail.com> wrote: > Hello Roman, > > PFB my response - > > As I understand, each protocol has a distinct set of event types (where > event type == metrics type); and a distinct set of devices. Is this correct? > Yes, correct. distinct events and devices. Each device emits these event. > > > Based on data protocol I have 4-5 topics. Currently the data for a > single event is being pushed to a partition of the kafka topic(producer key > -> event_type + data_protocol). > Here you are talking about the source (to Flink job), right? > Yes, you are right. > > Can you also share how are you going to consume these data? > By consume do you mean the downstream system? > If yes then this data will be written to a DB, some metrics goes to > TSDB(Influx) as well. > > Thanks, > Hemant > > On Tue, May 12, 2020 at 2:08 AM Khachatryan Roman < > khachatryan.ro...@gmail.com> wrote: > >> Hi Hemant, >> >> As I understand, each protocol has a distinct set of event types (where >> event type == metrics type); and a distinct set of devices. Is this correct? >> >> > Based on data protocol I have 4-5 topics. Currently the data for a >> single event is being pushed to a partition of the kafka topic(producer key >> -> event_type + data_protocol). >> Here you are talking about the source (to Flink job), right? >> >> Can you also share how are you going to consume these data? >> >> >> Regards, >> Roman >> >> >> On Mon, May 11, 2020 at 8:57 PM hemant singh <hemant2...@gmail.com> >> wrote: >> >>> Hi, >>> >>> I have different events from a device which constitutes different >>> metrics for same device. Each of these event is produced by the device in >>> interval of few milli seconds to a minute. >>> >>> Event1(Device1) -> Stream1 -> Metric 1 >>> Event2 (Device1) -> Stream2 -> Metric 2 ... >>> .............. >>> ....... >>> Event100(Device1) -> Stream100 -> Metric100 >>> >>> The number of events can go up to few 100s for each data protocol and we >>> have around 4-5 data protocols. Metrics from different streams makes up a >>> records >>> like for example from above example for device 1 - >>> >>> Device1 -> Metric1, Metric 2, Metric15 forms a single record for the >>> device. Currently in development phase I am using interval join to achieve >>> this, that is to create a record with latest data from different >>> streams(events). >>> >>> Based on data protocol I have 4-5 topics. Currently the data for a >>> single event is being pushed to a partition of the kafka topic(producer key >>> -> event_type + data_protocol). So essentially one topic is made up of many >>> streams. I am filtering on the key to define the streams. >>> >>> My question is - Is this correct way to stream the data, I had thought >>> of maintaining different topic for an event, however in that case number of >>> topics could go to few thousands and that is something which becomes little >>> challenging to maintain and not sure if kafka handles that well. >>> >>> I know there are traditional ways to do this like pushing it to >>> timeseries db and then joining data for different metric but that is >>> something which will never scale, also this processing should be as >>> realtime as possible. >>> >>> Are there better ways to handle this use case or I am on correct path. >>> >>> Thanks, >>> Hemant >>> >>