Hi Flavio, These typos of race conditions are not failure cases, so no exception is thrown. It only means that a single source tasks reads all (or most of the) splits and no splits are left for the other tasks. This can be a problem if a record represents a large amount of IO or an intensive computation as they might not be properly distributed. In that case you'd need to manually rebalance the partitions of a DataSet.
Fabian Am Mo., 29. Apr. 2019 um 14:42 Uhr schrieb Flavio Pompermaier < pomperma...@okkam.it>: > Hi Fabian, I wasn't aware that "race-conditions may happen if your splits > are very small as the first data source task might rapidly request and > process all splits before the other source tasks do their first request". > What happens exactly when a race-condition arise? Is this exception > internally handled by Flink or not? > > On Mon, Apr 29, 2019 at 11:51 AM Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi, >> >> The method that I described in the SO answer is still implemented in >> Flink. >> Flink tries to assign splits to tasks that run on local TMs. >> However, files are not split per line (this would be horribly >> inefficient) but in larger chunks depending on the number of subtasks (and >> in case of HDFS the file block size). >> >> Best, Fabian >> >> Am So., 28. Apr. 2019 um 18:48 Uhr schrieb Soheil Pourbafrani < >> soheil.i...@gmail.com>: >> >>> Hi >>> >>> I want to exactly how Flink read data in the both case of file in local >>> filesystem and file on distributed file system? >>> >>> In reading data from local file system I guess every line of the file >>> will be read by a slot (according to the job parallelism) for applying the >>> map logic. >>> >>> In reading from HDFS I read this >>> <https://stackoverflow.com/a/39153402/8110607> answer by Fabian Hueske >>> <https://stackoverflow.com/users/3609571/fabian-hueske> and i want to >>> know is that still the Flink strategy fro reading from distributed system >>> file? >>> >>> thanks >>> >> > >