[
https://issues.apache.org/jira/browse/TEZ-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
László Bodor updated TEZ-4069:
------------------------------
Fix Version/s: 0.10.3
> Avoid repeated computation of preferred locations in split grouping.
> --------------------------------------------------------------------
>
> Key: TEZ-4069
> URL: https://issues.apache.org/jira/browse/TEZ-4069
> Project: Apache Tez
> Issue Type: Improvement
> Affects Versions: 0.9.2
> Reporter: Oliver Draese
> Priority: Major
> Fix For: 0.10.3
>
> Attachments: TEZ-4069.1.patch, TEZ-4069.patch
>
>
> The TezSplitGrouper iterates through the list of splits multiple times, when
> trying to group the splits (see getGroupedSplits). Each time, it asks the
> locationProvider to return the array of preferred locations for the splits.
> This has two side effects:
> * generating the list of preferred locations can cause some CPU overhead
> (i.e. calculating the consistent hash in HostAffinitySplitLocationProvider),
> which can be avoided
> * if the list of preferred location is changing between the different loops
> of getGroupedSplits, we might encounter a NullPointerException. This happens
> if a new location appears, that was not part of the initial set of locations
> when populating the distinctLocations map.
> The getGroupedSplits should query the preferred locations only once (for each
> split) via the location provider and then memorize these instead of asking
> the location provider repeatedly.
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)