shanthoosh opened a new pull request #1119: SAMZA-2284: Remove redundant getStreamMetadata calls in SamzaContainer startup sequence. URL: https://github.com/apache/samza/pull/1119 **Problem:** SAMZA-2122 added support to initialize the lag of input streams during tasks initialization. In order to accomplish that, the metadata of input stream was fetched for every system stream partition assigned to a task instance. SamzaContainer startup sequence fetches the metadata of same input streams multiple times. Fetching the metadata of a stream entails making a expensive remote call to underlying messaging broker. This redundant fetch-input-stream-metadata API invocations incurred significant delays in the start of actual message processing in the samza job. **Impact:** 1. With some samza jobs at LinkedIn, we observed that this fetch-input-stream-metadata loop took around 1.5 hrs to complete. 2. The redundant fetch-input-stream-metadata remote API calls will significantly increase the load on the underlying messaging broker. **Fix:** This patch fixes the unnecessary delay by fetching the metadata of the input streams only once in SamzaContainer(rather than once per SystemStreamPartition per task). **Validation:** 1. Fix the existing unit-tests. 2. Verified this change with a sample samza-yarn job in LinkedIn and with a hello-samza job from samza-hello-world examples in OSS.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services