Hi,
We are new to Flink and would like to get some ideas of achieving parallelism 
and for maintaining state in Flink for few interesting problems.
We receive files that contain multiple types of events. We need to process them 
in the following way:

1. Multiple event sources send event files to our system.
2. Files from different event sources should be processed in parallel.
3. Each file contains events with one or more event types. We want to process 
each event file as follows:
3.1. Group all events of the same event type (e.g. EvTypeA, EvTypeB, EvTypeC, 
etc.).
3.2. Sequence the event groups in a predefined order and process them in that 
order as follows (EvTypeA-->EvTypeB-->EvTypeC) .
3.3 Based on an external configuration, the events in the each event type group 
should be processed either sequentially (based on the event time specified in 
each event) or in parallel. Each event type will get processed using a 
different Java class.
3.4. After all events for each event type are processed, process the events for 
the next event type group and so on. It is important to note that all events in 
each event type group need to be processed before the events in the next event 
group can be processed. Also as mentioned above, events in each event type 
group can be processed sequentially or in parallel based on a configuration.
3.5 Different files may contain different types of events and different number 
of event types. E.g. one file may contain EvTypeA+EvTypeB and another file may 
contain EvTypeA+EvTypeB+EvTypeC. So the execution plan for each file will be 
different.

Questions:
1. The parallelism should be achieved across multiple event sources and also 
within each event type group. How can we achieve this type of parallelism using 
Flink?
2. We want to record the progress of each file processing. So if a system 
failure occurs while processing a file, we will be able to resume from where we 
had left. For example, if a file contains 1000 events, 100 events of EvTypeA, 
700 events of EvTypeB, 200 events of EvTypeC and the system failure occurs 
after processing all 100 events of EvTypeA and after processing 150 events of 
EvTypeB, the system should resumeĀ  processing from 151st event of EvTypeB.
3. We also want to show the progress of each file processing status on a 
dashboard that shows all the files being processed and the progress status for 
each file (e.g. 250 out of 1000 event processed). Should we use Flink state 
management to keep status of each processed event or should we use our own 
state management and why?
4. If the Flink task node that was processing an event crashes, will Flink 
route that task to another node automatically?

 A few lines of Flink code that shows how to solve the above problems will be 
really helpful. Thank you for any help.

Search phrase: EventParallelism

- lgfmt

Reply via email to