Re: Question about datasource replication

Fabian Hueske Fri, 04 May 2018 08:15:48 -0700

Hi Flavio,

No, there's no way around it.
DataSets that are processed by more than one operator cannot be processed
by chained operators.
The records need to be copied to avoid concurrent modifications. However,
the data should not be shipped over the network if all operators have the
same parallelism.
Instead records are serialized and handed over via local byte[] in-memory
channels.


Best, Fabian


2018-05-04 14:55 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:

> Flink 1.3.1 (I'm waiting 1.5 before upgrading..)
>
> On Fri, May 4, 2018 at 2:50 PM, Amit Jain <aj201...@gmail.com> wrote:
>
>> Hi Flavio,
>>
>> Which version of Flink are you using?
>>
>> --
>> Thanks,
>> Amit
>>
>> On Fri, May 4, 2018 at 6:14 PM, Flavio Pompermaier <pomperma...@okkam.it>
>> wrote:
>> > Hi all,
>> > I've a Flink batch job that reads a parquet dataset and then applies 2
>> > flatMap to it (see pseudocode below).
>> > The problem is that this dataset is quite big and Flink duplicates it
>> before
>> > sending the data to these 2 operators (I've guessed this from the
>> doubling
>> > amount of sent bytes) .
>> > Is there a way to avoid this behaviour?
>> >
>> > -------------------------------------------------------
>> > Here's the pseudo code of my job:
>> >
>> > DataSet X = readParquetDir();
>> > X1 = X.flatMap(...);
>> > X2 = X.flatMap(...);
>> >
>> > Best,
>> > Flavio
>>
>
>

Re: Question about datasource replication

Reply via email to