Thanks Fabian, that's more clear..many times you don't know when to
rebalance or not a dataset because it depends on the specific use case and
dataset distribution.
An automatic way of choosing whether a Dataset could benefit from a
rebalance or not could be VERY nice (at least for batch) but I fear this
would be very hard to implement..am I wrong?

On Mon, Apr 29, 2019 at 3:10 PM Fabian Hueske <fhue...@gmail.com> wrote:

> Hi Flavio,
>
> These typos of race conditions are not failure cases, so no exception is
> thrown.
> It only means that a single source tasks reads all (or most of the) splits
> and no splits are left for the other tasks.
> This can be a problem if a record represents a large amount of IO or an
> intensive computation as they might not be properly distributed. In that
> case you'd need to manually rebalance the partitions of a DataSet.
>
> Fabian
>
> Am Mo., 29. Apr. 2019 um 14:42 Uhr schrieb Flavio Pompermaier <
> pomperma...@okkam.it>:
>
>> Hi Fabian, I wasn't aware that  "race-conditions may happen if your
>> splits are very small as the first data source task might rapidly request
>> and process all splits before the other source tasks do their first
>> request". What happens exactly when a race-condition arise? Is this
>> exception internally handled by Flink or not?
>>
>> On Mon, Apr 29, 2019 at 11:51 AM Fabian Hueske <fhue...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> The method that I described in the SO answer is still implemented in
>>> Flink.
>>> Flink tries to assign splits to tasks that run on local TMs.
>>> However, files are not split per line (this would be horribly
>>> inefficient) but in larger chunks depending on the number of subtasks (and
>>> in case of HDFS the file block size).
>>>
>>> Best, Fabian
>>>
>>> Am So., 28. Apr. 2019 um 18:48 Uhr schrieb Soheil Pourbafrani <
>>> soheil.i...@gmail.com>:
>>>
>>>> Hi
>>>>
>>>> I want to exactly how Flink read data in the both case of file in local
>>>> filesystem and file on distributed file system?
>>>>
>>>> In reading data from local file system I guess every line of the file
>>>> will be read by a slot (according to the job parallelism) for applying the
>>>> map logic.
>>>>
>>>> In reading from HDFS I read this
>>>> <https://stackoverflow.com/a/39153402/8110607> answer by Fabian Hueske
>>>> <https://stackoverflow.com/users/3609571/fabian-hueske> and i want to
>>>> know is that still the Flink strategy fro reading from distributed system
>>>> file?
>>>>
>>>> thanks
>>>>
>>>
>>
>>

Reply via email to