How to Fan Out to 100s of Sinks

2021-11-28 Thread SHREEKANT ANKALA
Hi all, we current have a Flink job that retrieves jsonl data from GCS and 
writes to Iceberg Tables. We are using Flink 13.2 and things are working fine.

We now have to fan out that same data in to 100 different sinks - Iceberg 
Tables on s3. There will be 100 buckets and the data needs to be sent to each 
of these 100 different buckets.

We are planning to add a new Job that will write to 1 sink at a time for each 
time it is launched. Is there any other optimal approach possible in Flink to 
support this use case of 100 different sinks?


Re: How to Fan Out to 100s of Sinks

2021-11-29 Thread Fabian Paul
Hi,

What do you mean by "fan out" to 100 different sinks? Do you want to
replicate the data in all buckets or is there some conditional
branching logic?

In general, Flink can easily support 100 different sinks but I am not
sure if this is the right approach for your use case. Can you clarify
your motivation and tell us a bit more about the exact scenario?

Best,
Fabian



On Mon, Nov 29, 2021 at 1:11 AM SHREEKANT ANKALA  wrote:
>
> Hi all, we current have a Flink job that retrieves jsonl data from GCS and 
> writes to Iceberg Tables. We are using Flink 13.2 and things are working fine.
>
> We now have to fan out that same data in to 100 different sinks - Iceberg 
> Tables on s3. There will be 100 buckets and the data needs to be sent to each 
> of these 100 different buckets.
>
> We are planning to add a new Job that will write to 1 sink at a time for each 
> time it is launched. Is there any other optimal approach possible in Flink to 
> support this use case of 100 different sinks?


Re: How to Fan Out to 100s of Sinks

2021-11-29 Thread SHREEKANT ANKALA
Hi,
Here is our scenario:

We have a system that generates data in a jsonl file for all of customers 
together. We now need to process this jsonl data and conditionally distribute 
the data to individual customer based on their preferences as Iceberg Tables. 
So every line in the jsonl file, the data will end up one of the customers S3 
bucket as an Iceberg table row. We were hoping to continue using Flink for this 
use case by just one job doing a conditional sink, but we are not sure if that 
would be the right usage of Flink.

Thanks,
Shree

From: Fabian Paul 
Sent: Monday, November 29, 2021 1:57 AM
To: SHREEKANT ANKALA 
Cc: user@flink.apache.org 
Subject: Re: How to Fan Out to 100s of Sinks

Hi,

What do you mean by "fan out" to 100 different sinks? Do you want to
replicate the data in all buckets or is there some conditional
branching logic?

In general, Flink can easily support 100 different sinks but I am not
sure if this is the right approach for your use case. Can you clarify
your motivation and tell us a bit more about the exact scenario?

Best,
Fabian



On Mon, Nov 29, 2021 at 1:11 AM SHREEKANT ANKALA  wrote:
>
> Hi all, we current have a Flink job that retrieves jsonl data from GCS and 
> writes to Iceberg Tables. We are using Flink 13.2 and things are working fine.
>
> We now have to fan out that same data in to 100 different sinks - Iceberg 
> Tables on s3. There will be 100 buckets and the data needs to be sent to each 
> of these 100 different buckets.
>
> We are planning to add a new Job that will write to 1 sink at a time for each 
> time it is launched. Is there any other optimal approach possible in Flink to 
> support this use case of 100 different sinks?


Re: How to Fan Out to 100s of Sinks

2021-11-30 Thread Fabian Paul
Hi Shree,

I think for every Iceberg Table you have to instantiate a different
sink in your program. You basically have one operator before your
sinks that decides where to route the records. You probably end up
with one Iceberg sink for each of your customers. Maybe you can take a
look at the DemultiplixingSink [1] but unfortunately, there has not
been much progress yet.

Best,
Fabian

[1] https://issues.apache.org/jira/browse/FLINK-24493

On Mon, Nov 29, 2021 at 7:11 PM SHREEKANT ANKALA  wrote:
>
> Hi,
> Here is our scenario:
>
> We have a system that generates data in a jsonl file for all of customers 
> together. We now need to process this jsonl data and conditionally distribute 
> the data to individual customer based on their preferences as Iceberg Tables. 
> So every line in the jsonl file, the data will end up one of the customers S3 
> bucket as an Iceberg table row. We were hoping to continue using Flink for 
> this use case by just one job doing a conditional sink, but we are not sure 
> if that would be the right usage of Flink.
>
> Thanks,
> Shree
> 
> From: Fabian Paul 
> Sent: Monday, November 29, 2021 1:57 AM
> To: SHREEKANT ANKALA 
> Cc: user@flink.apache.org 
> Subject: Re: How to Fan Out to 100s of Sinks
>
> Hi,
>
> What do you mean by "fan out" to 100 different sinks? Do you want to
> replicate the data in all buckets or is there some conditional
> branching logic?
>
> In general, Flink can easily support 100 different sinks but I am not
> sure if this is the right approach for your use case. Can you clarify
> your motivation and tell us a bit more about the exact scenario?
>
> Best,
> Fabian
>
>
>
> On Mon, Nov 29, 2021 at 1:11 AM SHREEKANT ANKALA  wrote:
> >
> > Hi all, we current have a Flink job that retrieves jsonl data from GCS and 
> > writes to Iceberg Tables. We are using Flink 13.2 and things are working 
> > fine.
> >
> > We now have to fan out that same data in to 100 different sinks - Iceberg 
> > Tables on s3. There will be 100 buckets and the data needs to be sent to 
> > each of these 100 different buckets.
> >
> > We are planning to add a new Job that will write to 1 sink at a time for 
> > each time it is launched. Is there any other optimal approach possible in 
> > Flink to support this use case of 100 different sinks?