Re: Small-files source - partitioning based on prefix of file

Fabian Hueske Tue, 31 Jul 2018 01:47:43 -0700

Hi Averell,

The records emitted by the monitoring tasks are "just" file splits, i.e.,
meta information that defines which data to read from where.
The reader tasks receive these splits and process them by reading the
corresponding files.


You could of course partition the splits based on the file name (or
whatever attribute) however, this is not the only thing you need to change
if you want to have a fault tolerant setup.
A reader task stores the splits that it hasn't processed yet in operator
state which is randomly redistributed when the operator recovers from a
failure (or when rescaling the appliation)
You would need to change the logic of the reader task as well to ensure
that the splits are deterministically assigned to reader tasks.

TBH, I would just add a keyBy() after the source. Since, the monitoring
sink just emits meta data, the data won't be shuffled twice.

Best, Fabian

2018-07-31 6:54 GMT+02:00 vino yang <yanghua1...@gmail.com>:

> Hi Averell,
>
> Actually, Performing a key partition inside the Source Function is the
> same as DataStream[Source].keyBy(cumstom partitioner), because keyBy is
> not a real operator, but a virtual node in a DAG, which does not correspond
> to a physical operator.
>
> Thanks, vino.
>
> 2018-07-31 10:52 GMT+08:00 Averell <lvhu...@gmail.com>:
>
>> Hi Vino,
>>
>> I'm a little bit confused.
>> If I can do the partitioning from within the source function, using the
>> same
>> hash function on the key to identify the partition, would that be
>> sufficient
>> to avoid shuffling in the next byKey call?
>>
>> Thanks.
>> Averell
>>
>>
>>
>> --
>> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/
>>
>
>

Re: Small-files source - partitioning based on prefix of file

Reply via email to