shanthoosh opened a new pull request #1119: SAMZA-2284: Remove redundant 
getStreamMetadata calls in SamzaContainer startup sequence.
URL: https://github.com/apache/samza/pull/1119
 
 
   **Problem:**
   
   SAMZA-2122 added support to initialize the lag of input streams during tasks 
initialization. In order to accomplish that, the metadata of input stream was 
fetched for every system stream partition assigned to a task instance.
   
   SamzaContainer startup sequence fetches the metadata of same input streams 
multiple times. Fetching the metadata of a stream entails making a expensive 
remote call to underlying messaging broker. This redundant 
fetch-input-stream-metadata API invocations incurred significant delays in the 
start of actual message processing in the samza job.
   
   **Impact:**
   
   1. With some samza jobs at LinkedIn, we observed that this 
fetch-input-stream-metadata loop took around 1.5 hrs to complete.
   2. The redundant fetch-input-stream-metadata remote API calls will 
significantly increase the load on the underlying messaging broker.
   
   **Fix:**
   
   This patch fixes the unnecessary delay by fetching the metadata of the input 
streams only once in SamzaContainer(rather than once per SystemStreamPartition 
per task).
   
   **Validation:**
   
   1. Fix the existing unit-tests.
   2. Verified this change with a sample samza-yarn job in LinkedIn and with a 
hello-samza job from samza-hello-world examples in OSS. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to