We can certainly help you debug this more. Some questions: 1. Are you processing messages (at all) from the "suffering" containers? (You can verify that by observing metrics/ logging etc.)
2. If you are indeed processing messages, is it possible the impacted containers not able to keep up with the surge in the data? You can try re-partitioning your input topics (and increasing the number of containers) 3. If you are not processing messages, maybe can you provide us with your stack trace? It will be super-helpful to find out if (or where) containers are stuck. Thanks, Jagadish On Wed, Mar 8, 2017 at 1:05 PM, Ankit Malhotra <amalho...@appnexus.com> wrote: > Hi, > > While joining streams from 2 partitions to join 2 streams, we see that > some containers start suffering in that, lag (messages behind high > watermark) for one of the tasks starts sky rocketing while the other one is > ~ 0. > > We are using default values for buffer sizes, fetch threshold, are using 4 > threads as part of the pool and are using the default > RoundRobinMessageChooser. > > Happy to share more details/config if it can help to debug this further. > > Thanks > Ankit > > > > > -- Jagadish V, Graduate Student, Department of Computer Science, Stanford University