[ 
https://issues.apache.org/jira/browse/TEZ-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14621362#comment-14621362
 ] 

Saikat edited comment on TEZ-2172 at 7/28/15 2:21 PM:
------------------------------------------------------

Using an approach similar to TEZ-2613. A linked hashmap 
<InputAttemptIdentifier.toString(), InputAttemptIdentifier>


was (Author: saikatr):
LinkedHashSet seems to be a good option as it also retains the order in which 
the items are inserted into the set and provides constant time performance for 
add, contains and remove.

But for pipelined shuffle can have multiple spill ids(which is not used in the 
equals.)
So we could pass an indication to fetchers that the input attempts are all for 
a pipelined shuffle type fetch (which would then include spill id also for 
comparison in a custom comparator wrapper) else ignore the spill id and use 
default equals for inputAttemptIdentifier.


This approach may not work if in future a task can switch from pipelined 
shuffle to final merger type or vice versa. (or decide to send out spills CDMEs 
if data is too skewed).
In current implementation, the configuration of pipeline shuffle enable for a 
task is static.

> FetcherOrderedGrouped using List to store InputAttemptIdentifier can lead to 
> some inefficiency during remove() operation
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-2172
>                 URL: https://issues.apache.org/jira/browse/TEZ-2172
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: Saikat
>
> As part of fixing TEZ-2001, FetcherOrderedGrouped stores 
> InputAttemptIdentifier in List.  This can lead to some inefficiency - since 
> the size of this list can be ~30, and remove() calls can be expensive. 
> Option 1:  by using the spillId in the hashCode - or a wrapping structure for 
> just this. However, SpillId can not be added to the hashCode as it would 
> break ShuffleScheduler shuffleInfoEventsMap. 
> Option 2: consider using Map with an identifier. 
> Need to consider other options as well. Creating this jira as a placeholder 
> to fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to