Could you please point me to any documentation on the "credit-based flow control" approach....
On Tue, Jan 2, 2018 at 10:35 AM, Timo Walther <twal...@apache.org> wrote: > Hi Vishal, > > your assumptions sound reasonable to me. The community is currently > working on a more fine-grained back pressuring with credit-based flow > control. It is on the roamap for 1.5 [1]/[2]. I will loop in Nico that > might tell you more about the details. Until then I guess you have to > implement a custom source/adapt an existing source to let the data flow in > more realistic. > > Regards, > Timo > > [1] http://flink.apache.org/news/2017/11/22/release-1.4-and-1.5- > timeline.html > [2] https://www.youtube.com/watch?v=scStdhz9FHc > > > Am 1/2/18 um 4:02 PM schrieb Vishal Santoshi: > > I did a simulation on session windows ( in 2 modes ) and let it rip for >> about 12 hours >> >> 1. Replay where a kafka topic with retention of 7 days was the source ( >> earliest ) >> 2. Start the pipe with kafka source ( latest ) >> >> I saw results that differed dramatically. >> >> On replay the pipeline stalled after good ramp up while in the second >> case the pipeline hummed on without issues. For the same time period the >> data consumed is significantly more in the second case with the WM >> progression stalled in the first case with no hint of resolution ( the >> incoming data on source topic far outstrips the WM progression ) I think I >> know the reasons and this is my hypothesis. >> >> In replay mode the number of windows open do not have an upper bound. >> While buffer exhaustion ( and data in flight with watermark ) is the >> reason for throttle, it does not really limit the open windows and in fact >> creates windows that reflect futuristic data ( future is relative to the >> current WM ) . So if partition x has data for watermark time t(x) and >> partition y for watermark time t(y) and t(x) << t(y) where the overall >> watermark is t(x) nothing significantly throttles consumption from the y >> partition ( in fact for x too ) , the bounded buffer based approach does >> not give minute control AFAIK as one would hope and that implies there are >> far more open windows than the system can handle and that leads to the >> pathological case where the buffers fill up ( I believe that happens way >> late ) and throttling occurs but the WM does not proceed and windows that >> could ease the glut the throttling cannot proceed..... In the replay mode >> the amount of data implies that the Fetchers keep pulling data at the >> maximum consumption allowed by the open ended buffer approach. >> >> My question thus is, is there any way to have a finer control of back >> pressure, where in the consumption from a source is throttled preemptively >> ( by for example decreasing the buffers associated for a pipe or the size >> allocated ) or sleeps in the Fetcher code that can help aligning the >> performance to have real time consumption characteristics >> >> Regards, >> >> Vishal. >> >> >> >> >> >