Re: Combining multiple stages into a multi-stage processing pipeline

Yunfeng Zhou Sat, 06 Apr 2024 20:26:08 -0700

Hi Mark,

IMHO, your design of the Flink application is generally feasible. In
Flink ML, I have once met a similar design in ChiSqTest operator,
where the input data is first aggregated to generate some results and
then broadcast and connected with other result streams from the same
input afterwards. You may refer to this algorithm for more details
when designing your applications.
https://github.com/apache/flink-ml/blob/master/flink-ml-lib/src/main/java/org/apache/flink/ml/stats/chisqtest/ChiSqTest.java


Besides, side outputs are typically used when you want to split an
output stream into different categories. Given that the
ProcessWindowFn before each SideOutput-x only has one downstream, it
would be enough to directly pass the resulting DataStream to session
windows instead of introducing side outputs.

Best,
Yunfeng

On Sun, Apr 7, 2024 at 12:41 AM Mark Petronic <markpetro...@gmail.com> wrote:
>
> I am looking for some design advice for a new Flink application and I am 
> relatively new to Flink - I have one, fairly straightforward Flink 
> application in production so far.
>
> For this new application, I want to create a three-stage processing pipeline. 
> Functionally, I am seeing this as ONE long datastream. But, I have to 
> evaluate the STAGE-1 data in a special manner to then pass on that evaluation 
> to STAGE-2 where it will do its own special evaluation using the STAGE-1 
> evaluation results to shape its evaluation. The same thing happens again in 
> STAGE-3, using the STAGE-2 evaluation results. Finally, the end result is 
> published to Kafka. The stages functionally look like this:
>
> STAGE-1
> KafkaSource |=> Keyby => TumblingWindows1 => ProcessWindowFn => SideOutput-1 
> |=> SessionWindow1 => ProcessWindowFn => (SideOutput-2[WindowRecords], 
> KafkaSink[EvalResult])
>             |=================> WindowAll => ProcessWindowFn => SideOutput-1 ^
>
> STAGE-2
> SideOutput-2 => Keyby => TumblingWindows2 => ProcessWindowFn => SideOutput-3 
> => SessionWindow2 => ProcessWindowFn => (SideOutput-4[WindowRecords], 
> KafkaSink[EvalResult])
>
> STAGE-3
> SideOutput-4 => Keyby => TumblingWindows3 => ProcessWindowFn => SideOutput-5 
> => SessionWindow3 => ProcessWindowFn => KafkaSink
>
> DESCRIPTION
>
> In STAGE-1, there are a fixed number of known keys so I will only see at most 
> about 21 distinct keys and therefore up to 21 tumbling one-minute windows. I 
> also need to aggregate all data in a global window to get an overall 
> non-keyed result. I need to bring the 21 results from those 21 tumbling 
> windows AND the one global result into one place where I can compare each of 
> the 21 windows results to the one global result. Based on this evaluation, 
> only some of the 21 windows results will survive that test. I want to then 
> take the data records from those, say 3 surviving windows, and make them the 
> "source" for STAGE-2 processing as well as publish some intermediate 
> evaluation results to a KafkaSink. STAGE-2 will reprocess the same data 
> records that the three STAGE-1 surviving windows processed, only keying them 
> by different dimensions. I expect there to be around 4000 fairly small 
> records per each of the 21 STAGE-1 windows so, in this example, I would be 
> sending 4000 x 3 = 12000 records in SideOutput-2 to form the new "source" 
> datastream for STAGE-2.
>
> Where I am struggling is:
>
> Trying to figure out how to best connect the output of the 21 STAGE-1 windows 
> and the one WIndowAll window records into a single point (I propose 
> SessionWindow1) to be able to compare each of the 21 windows data results 
> with the WindowAll non-keyed results.
> The best way to connect together these multiple stages.
>
> Looking at the STAGE-1 approach illustrated above, this is my attempt at an 
> approach using side outputs to:
>
> Form a new "source" data stream that contains the outputs of each of the 21 
> windows and the WindowAll data
> Consume that into a single session window
> Do the evaluations between the 21 keyed windows against the overall WindowAll 
> data
> Then emit only the 3 surviving sets of data from the 3 tumbling windows 
> outputs from the ProcessWindowFn to SideOutput-2 and the evaluation results 
> to Kafka
> Finally, SideOutput-2 will then form the new data stream "source" for STAGE-2 
> where a similar process will repeat, passing data to a STAGE-3, again similar 
> processing, to finally obtain the desired result that will be published to 
> Kafka.
>
> I would greatly appreciate the following:
>
> Comments on if this is a valid approach - am I on the right track here?
> Could you suggest an alternate approach that I could investigate if this is 
> problematic?.
>
> I am trying to build a Flink application that follows intended best practices 
> so I am just looking for some confirmation that I am heading down a 
> reasonable path for this design.
>
> Thank you in advance,
> Mark
>

Re: Combining multiple stages into a multi-stage processing pipeline

Reply via email to