[ https://issues.apache.org/jira/browse/SAMZA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472871#comment-16472871 ]
ASF GitHub Bot commented on SAMZA-1670: --------------------------------------- GitHub user cameronlee314 opened a pull request: https://github.com/apache/samza/pull/520 SAMZA-1670 : When fetching a newest offset for a partition, also prefetch and cache the newest offsets for other partitions on the container You can merge this pull request into a Git repository by running: $ git pull https://github.com/cameronlee314/samza partition_metadata Alternatively you can review and apply these changes as the patch at: https://github.com/apache/samza/pull/520.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #520 ---- commit 20c483192ba0a378f02ea38afa9729a81ee74b33 Author: Cameron Lee <calee@...> Date: 2018-05-11T22:18:06Z SAMZA-1670 : When fetching a newest offset for a partition, also prefetch and cache the newest offsets for other partitions on the container ---- > When fetching a newest offset for a partition, also prefetch and cache the > newest offsets for other partitions on the container > ------------------------------------------------------------------------------------------------------------------------------- > > Key: SAMZA-1670 > URL: https://issues.apache.org/jira/browse/SAMZA-1670 > Project: Samza > Issue Type: Improvement > Reporter: Cameron Lee > Priority: Major > Labels: metadata > > ExtendedSystemAdmin.getNewestOffset current just works on one > system-stream-partition at a time. As an optimization, when one > system-stream-partition needs a newest offset, a batch call can be leveraged > to also fetch newest offsets (and cache the data) for other partitions on the > same container. > This can help to reduce the call volume to system admins to get newest offset > metadata. This can also help reduce contention on system admins when metadata > is needed by multiple threads at the same time. > *Proposed approach:* > Add a new getNewestOffset API to StreamMetadataCache. Have the cache keep > track of all system-stream-partitions that have asked for newest offsets > before, and when a system-stream-partition needs newest offset metadata, > check if there are any other stale entries and fetch those as well. This also > requires adding a getNewestOffsets batch call to ExtendedSystemAdmin. The > benefit here is that StreamMetadataCache is already reused by multiple tasks, > but the disadvantage is that it has to keep track of new state. > *Alternative approach:* > Collect all system-stream-partitions that will need newest offset metadata at > setup, and then make the batch call whenever any of those partitions needs > metadata and cache the metadata. The benefit for this approach is that no > state needs to be built up, as it is known at setup, but it might be unclean > to do the initial collection and keep track of it. For example, it might be > necessary to store container-granular information inside partition-granular > objects (e.g. TaskStorageManager). -- This message was sent by Atlassian JIRA (v7.6.3#76005)