Re: Better way to share large data across task managers

2020-09-25 Thread Kostas Kloudas
Hi Dongwon, Yes, you are right that I assume that broadcasting occurs once. This is what I meant by "If you know the data in advance". Sorry for not being clear. If you need to periodically broadcast new versions of the data, then I cannot find a better solution than the one you propose with the

Re: Better way to share large data across task managers

2020-09-23 Thread Dongwon Kim
Hi Kostas, Thanks for the input! BTW, I guess you assume that the broadcasting occurs just once for bootstrapping, huh? My job needs not only bootstrapping but also periodically fetching a new version of data from some external storage. Thanks, Dongwon > 2020. 9. 23. 오전 4:59, Kostas Kloudas

Re: Better way to share large data across task managers

2020-09-22 Thread Kostas Kloudas
Hi Dongwon, If you know the data in advance, you can always use the Yarn options in [1] (e.g. the "yarn.ship-directories") to ship the directories with the data you want only once to each Yarn container (i.e. TM) and then write a udf which reads them in the open() method. This will allow the data

Better way to share large data across task managers

2020-09-20 Thread Dongwon Kim
Hi, I'm using Flink broadcast state similar to what Fabian explained in [1]. One difference might be the size of the broadcasted data; the size is around 150MB. I've launched 32 TMs by setting - taskmanager.numberOfTaskSlots : 6 - parallelism of the non-broadcast side : 192 Here's some