[ https://issues.apache.org/jira/browse/TEZ-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14621362#comment-14621362 ]
Saikat edited comment on TEZ-2172 at 7/28/15 2:21 PM: ------------------------------------------------------ Using an approach similar to TEZ-2613. A linked hashmap <InputAttemptIdentifier.toString(), InputAttemptIdentifier> was (Author: saikatr): LinkedHashSet seems to be a good option as it also retains the order in which the items are inserted into the set and provides constant time performance for add, contains and remove. But for pipelined shuffle can have multiple spill ids(which is not used in the equals.) So we could pass an indication to fetchers that the input attempts are all for a pipelined shuffle type fetch (which would then include spill id also for comparison in a custom comparator wrapper) else ignore the spill id and use default equals for inputAttemptIdentifier. This approach may not work if in future a task can switch from pipelined shuffle to final merger type or vice versa. (or decide to send out spills CDMEs if data is too skewed). In current implementation, the configuration of pipeline shuffle enable for a task is static. > FetcherOrderedGrouped using List to store InputAttemptIdentifier can lead to > some inefficiency during remove() operation > ------------------------------------------------------------------------------------------------------------------------ > > Key: TEZ-2172 > URL: https://issues.apache.org/jira/browse/TEZ-2172 > Project: Apache Tez > Issue Type: Improvement > Reporter: Rajesh Balamohan > Assignee: Saikat > > As part of fixing TEZ-2001, FetcherOrderedGrouped stores > InputAttemptIdentifier in List. This can lead to some inefficiency - since > the size of this list can be ~30, and remove() calls can be expensive. > Option 1: by using the spillId in the hashCode - or a wrapping structure for > just this. However, SpillId can not be added to the hashCode as it would > break ShuffleScheduler shuffleInfoEventsMap. > Option 2: consider using Map with an identifier. > Need to consider other options as well. Creating this jira as a placeholder > to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)