[ 
https://issues.apache.org/jira/browse/SAMZA-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472871#comment-16472871
 ] 

ASF GitHub Bot commented on SAMZA-1670:
---------------------------------------

GitHub user cameronlee314 opened a pull request:

    https://github.com/apache/samza/pull/520

    SAMZA-1670 : When fetching a newest offset for a partition, also prefetch 
and cache the newest offsets for other partitions on the container

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cameronlee314/samza partition_metadata

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/samza/pull/520.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #520
    
----
commit 20c483192ba0a378f02ea38afa9729a81ee74b33
Author: Cameron Lee <calee@...>
Date:   2018-05-11T22:18:06Z

    SAMZA-1670 : When fetching a newest offset for a partition, also prefetch 
and cache the newest offsets for other partitions on the container

----


> When fetching a newest offset for a partition, also prefetch and cache the 
> newest offsets for other partitions on the container
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SAMZA-1670
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1670
>             Project: Samza
>          Issue Type: Improvement
>            Reporter: Cameron Lee
>            Priority: Major
>              Labels: metadata
>
> ExtendedSystemAdmin.getNewestOffset current just works on one 
> system-stream-partition at a time. As an optimization, when one 
> system-stream-partition needs a newest offset, a batch call can be leveraged 
> to also fetch newest offsets (and cache the data) for other partitions on the 
> same container.
> This can help to reduce the call volume to system admins to get newest offset 
> metadata. This can also help reduce contention on system admins when metadata 
> is needed by multiple threads at the same time.
> *Proposed approach:*
> Add a new getNewestOffset API to StreamMetadataCache. Have the cache keep 
> track of all system-stream-partitions that have asked for newest offsets 
> before, and when a system-stream-partition needs newest offset metadata, 
> check if there are any other stale entries and fetch those as well. This also 
> requires adding a getNewestOffsets batch call to ExtendedSystemAdmin. The 
> benefit here is that StreamMetadataCache is already reused by multiple tasks, 
> but the disadvantage is that it has to keep track of new state.
> *Alternative approach:*
> Collect all system-stream-partitions that will need newest offset metadata at 
> setup, and then make the batch call whenever any of those partitions needs 
> metadata and cache the metadata. The benefit for this approach is that no 
> state needs to be built up, as it is known at setup, but it might be unclean 
> to do the initial collection and keep track of it. For example, it might be 
> necessary to store container-granular information inside partition-granular 
> objects (e.g. TaskStorageManager).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to