I did a simulation on session windows ( in 2 modes ) and let it rip for
about 12 hours

1. Replay where a kafka topic with retention of 7 days was the source (
earliest )
2. Start the pipe with kafka source ( latest )

I saw results that differed dramatically.

On replay the pipeline stalled after  good ramp up while in the second case
the pipeline hummed on without issues. For the same time period the data
consumed is significantly more in the second case with the WM progression
stalled in the first case with no hint of resolution ( the incoming data on
source topic far outstrips the WM progression )  I think I know the reasons
and this is my hypothesis.

In replay mode the number of windows open do not have an upper bound. While
buffer exhaustion ( and data in flight with watermark )  is the reason for
throttle, it does not really limit the open windows and in fact creates
windows that reflect futuristic data ( future is relative to the current WM
) . So if partition x has data for watermark time t(x) and partition y for
watermark time t(y) and t(x) << t(y) where the overall watermark is t(x)
nothing significantly throttles consumption from the y partition ( in fact
for x too ) , the bounded buffer based approach does not give minute
control AFAIK as one would hope and that implies there are far more open
windows than the system can handle and that leads to the pathological case
where the buffers fill up  ( I believe that happens way late ) and
throttling occurs but the WM does not proceed and windows that could ease
the glut the throttling cannot proceed..... In the replay mode the amount
of data implies that the Fetchers keep pulling data at the maximum
consumption allowed by the open ended buffer approach.

My question thus is, is there any way to have a finer control of back
pressure, where in the consumption from a source is throttled preemptively
( by for example decreasing the buffers associated for a pipe or the size
allocated ) or sleeps in the Fetcher code that can help aligning the
performance to have real time consumption  characteristics

Regards,

Vishal.

Reply via email to