[ 
https://issues.apache.org/jira/browse/TEZ-2719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-2719:
-----------------------------
    Attachment: TEZ-2719.3.patch

> Consider reducing logs in unordered fetcher with shared-fetch option
> --------------------------------------------------------------------
>
>                 Key: TEZ-2719
>                 URL: https://issues.apache.org/jira/browse/TEZ-2719
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-2719.1.patch, TEZ-2719.2.patch, TEZ-2719.3.patch, 
> TEZ-2719.branch-0.7.patch
>
>
> For large broadcast, this can be a problem
> e.g 
> In one of the jobs (query_17 @ 10 TB scale), Map 7 generates around 1.1 GB of 
> data which is given to 330 tasks in downstream Map 1.
> Map 1 uses all slots in cluster (~ 224 per wave). Until data is downloaded, 
> shared fetch would end up re-queuing fetches.  As a part of it, it would end 
> up printing 3 logs per attempt. E.g
> {noformat}
> 2015-08-14 02:09:11,761 INFO [Fetcher [Map_7] #0] shuffle.Fetcher: Requeuing 
> machine1:13562 downloads because we didn't get a lock
> 2015-08-14 02:09:11,761 INFO [Fetcher [Map_7] #0] shuffle.Fetcher: Shared 
> fetch failed to return 1 inputs on this try
> 2015-08-14 02:09:11,761 INFO [ShuffleRunner [Map_7]] impl.ShuffleManager: 
> Scheduling fetch for inputHost: machine1:13562
> 2015-08-14 02:09:11,761 INFO [ShuffleRunner [Map_7]] impl.ShuffleManager: 
> Created Fetcher for host: machine1 with inputs: [InputAttemptIdentifier 
> [inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0, 
> pathComponent=attempt_1439264591968_0058_1_04_000000_0_10029, 
> fetchTypeInfo=FINAL_MERGE_ENABLED, spillEventId=-1]]
> {noformat}
> Based on disk / network, it  might take time for fetcher to finish 
> downloading and release the lock.  Since there was only one task in Map-1, it 
> ended up in a sort of tight loop generating relatively larger logs.
> Looks like 260-290 MB task log files are created in this case per attempt.  
> That would be around 2.3 GB to 3 GB (depending on number of slots waiting) in 
> machine with 8-10 slots.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to