Thanks Fabian, that's more clear..many times you don't know when to rebalance or not a dataset because it depends on the specific use case and dataset distribution. An automatic way of choosing whether a Dataset could benefit from a rebalance or not could be VERY nice (at least for batch) but I fear this would be very hard to implement..am I wrong?
On Mon, Apr 29, 2019 at 3:10 PM Fabian Hueske <fhue...@gmail.com> wrote: > Hi Flavio, > > These typos of race conditions are not failure cases, so no exception is > thrown. > It only means that a single source tasks reads all (or most of the) splits > and no splits are left for the other tasks. > This can be a problem if a record represents a large amount of IO or an > intensive computation as they might not be properly distributed. In that > case you'd need to manually rebalance the partitions of a DataSet. > > Fabian > > Am Mo., 29. Apr. 2019 um 14:42 Uhr schrieb Flavio Pompermaier < > pomperma...@okkam.it>: > >> Hi Fabian, I wasn't aware that "race-conditions may happen if your >> splits are very small as the first data source task might rapidly request >> and process all splits before the other source tasks do their first >> request". What happens exactly when a race-condition arise? Is this >> exception internally handled by Flink or not? >> >> On Mon, Apr 29, 2019 at 11:51 AM Fabian Hueske <fhue...@gmail.com> wrote: >> >>> Hi, >>> >>> The method that I described in the SO answer is still implemented in >>> Flink. >>> Flink tries to assign splits to tasks that run on local TMs. >>> However, files are not split per line (this would be horribly >>> inefficient) but in larger chunks depending on the number of subtasks (and >>> in case of HDFS the file block size). >>> >>> Best, Fabian >>> >>> Am So., 28. Apr. 2019 um 18:48 Uhr schrieb Soheil Pourbafrani < >>> soheil.i...@gmail.com>: >>> >>>> Hi >>>> >>>> I want to exactly how Flink read data in the both case of file in local >>>> filesystem and file on distributed file system? >>>> >>>> In reading data from local file system I guess every line of the file >>>> will be read by a slot (according to the job parallelism) for applying the >>>> map logic. >>>> >>>> In reading from HDFS I read this >>>> <https://stackoverflow.com/a/39153402/8110607> answer by Fabian Hueske >>>> <https://stackoverflow.com/users/3609571/fabian-hueske> and i want to >>>> know is that still the Flink strategy fro reading from distributed system >>>> file? >>>> >>>> thanks >>>> >>> >> >>