Re: Fwd: Can `DataStream`s "fan-in" to a single sink?

2021-09-22 Thread Chesnay Schepler
> The main concern I have is that concurrent calls from the different 
pipelines might step on each other in some way


In your current solution this wouldn't happen because you'd actually end 
up with 3 separate sink operators.


> job1.union(job2).union(job3).addSink(sink)

This is the way to go.


On 23/09/2021 07:21, Antony Southworth wrote:

Hi

Firstly, apologies if I commit any faux-pas, I have never used a 
mailing list before. At least from Googling, reading Flink docs, and 
searching the mailing list archives for "fan-in" didn't turn up much 
so hoping someone can enlighten me here.


My use-case is similar to the following:

    DataStream source = env.addSource(new 
FlinkKinesisConsumer("my-kinesis-stream", ...));
   SinkFunction sink = new 
MyCustomSNSSinkFunction("my-sns-topic", ...);


    // "fan-out" to several paths
    DataStream job1 = source
    .assignTimestampsAndWatermarks(...)
    .keyBy(...)
    .window(TumblingEventTimeWindows.of(...))
    .process(new Job1ProcessFunction());

    DataStream job2 = source
    .assignTimestampsAndWatermarks(...)
    .keyBy(...)
    .window(TumblingEventTimeWindows.of(...))
    .process(new Job2ProcessFunction());

    DataStream job3 = source
    .assignTimestampsAndWatermarks(...)
    .keyBy(...)
    .window(TumblingEventTimeWindows.of(...))
    .process(new Job3ProcessFunction());

    // part I am curious about
    // add the same sink object to each path
    job1.addSink(sink);
    job2.addSink(sink);
    job3.addSink(sink);

    env.execute();

Hopefully it's clear, the situation I describe (though I omit a lot of 
details so please tell me if it's not so, or if any additional details 
could help).


I couldn't find any examples in my Googling or reading the Flink docs 
of situations like this, where there are several processing pipelines 
(`job1`, `job2`, and `job3` in my example code) all feeding in to the 
same `SinkFunction` object. The API docs don't really mention anything 
about this case either (again, unless I missed it, in which case 
please point it out to me :) ).


The main concern I have is that concurrent calls from the different 
pipelines might step on each other in some way; can anyone confirm or 
deny that the concern is valid? Should I be using `DataStream.union` 
first? E.g. like `job1.union(job2).union(job3).addSink(sink)`?


Appreciate any advice people have :)

---

*Antony Southworth*
Data Engineer - Halter Limited

/This message (including any attachments) may contain confidential, 
proprietary, privileged and/or private information. The information is 
intended to be for the use of the individual or entity designated 
above. If you are not the intended recipient of this message, please 
notify the sender immediately, and delete the message and any 
attachments. Any disclosure, reproduction, distribution or other use 
of this message or any attachments by an individual or entity other 
than the intended recipient is prohibited. / 





Fwd: Can `DataStream`s "fan-in" to a single sink?

2021-09-22 Thread Antony Southworth
Hi

Firstly, apologies if I commit any faux-pas, I have never used a mailing
list before. At least from Googling, reading Flink docs, and searching the
mailing list archives for "fan-in" didn't turn up much so hoping someone
can enlighten me here.

My use-case is similar to the following:

DataStream source = env.addSource(new
FlinkKinesisConsumer("my-kinesis-stream", ...));
   SinkFunction sink = new MyCustomSNSSinkFunction("my-sns-topic",
...);

// "fan-out" to several paths
DataStream job1 = source
.assignTimestampsAndWatermarks(...)
.keyBy(...)
.window(TumblingEventTimeWindows.of(...))
.process(new Job1ProcessFunction());

DataStream job2 = source
.assignTimestampsAndWatermarks(...)
.keyBy(...)
.window(TumblingEventTimeWindows.of(...))
.process(new Job2ProcessFunction());

DataStream job3 = source
.assignTimestampsAndWatermarks(...)
.keyBy(...)
.window(TumblingEventTimeWindows.of(...))
.process(new Job3ProcessFunction());

// part I am curious about
// add the same sink object to each path
job1.addSink(sink);
job2.addSink(sink);
job3.addSink(sink);

env.execute();

Hopefully it's clear, the situation I describe (though I omit a lot of
details so please tell me if it's not so, or if any additional details
could help).

I couldn't find any examples in my Googling or reading the Flink docs of
situations like this, where there are several processing pipelines (`job1`,
`job2`, and `job3` in my example code) all feeding in to the same
`SinkFunction` object. The API docs don't really mention anything about
this case either (again, unless I missed it, in which case please point it
out to me :) ).

The main concern I have is that concurrent calls from the different
pipelines might step on each other in some way; can anyone confirm or deny
that the concern is valid? Should I be using `DataStream.union` first? E.g.
like `job1.union(job2).union(job3).addSink(sink)`?

Appreciate any advice people have :)

---

*Antony Southworth*
Data Engineer - Halter Limited

-- 
_This message (including any attachments) may contain confidential, 
proprietary, privileged and/or private information. The information is 
intended to be for the use of the individual or entity designated above. If 
you are not the intended recipient of this message, please notify the 
sender immediately, and delete the message and any attachments. Any 
disclosure, reproduction, distribution or other use of this message or any 
attachments by an individual or entity other than the intended recipient is 
prohibited. _