Thanks,

it’s 0.6.0 and it looks like it is in that state forever. Job got killed after 
7 hours.

Johannes

> On 11 Aug 2015, at 11:21, Rajesh Balamohan <[email protected]> wrote:
> 
> Is it 0.6.0 or 0.6.1? Does that fetcher get locked up in Status.WAIT forever?
> 
> This could be related to https://issues.apache.org/jira/browse/TEZ-1923 
> <https://issues.apache.org/jira/browse/TEZ-1923> (fix available in 0.6.1).
> 
> ~Rajesh.B
> 
> On Tue, Aug 11, 2015 at 1:48 PM, Johannes Zillmann <[email protected] 
> <mailto:[email protected]>> wrote:
> Hey Tez Community,
> 
> a customer of ours has following situation. He sets up a computation pipeline 
> in a way that 2 independent graphs are running in an Tez DAG.
> Result: the job is still running after a half day with only alive task 
> running, seems to running forever. 
> 
> If he splits the pipeline that each of the independent graph are executed in 
> their own DAG, then each of the DAG’s are completed in one or two hours.
> 
> Here is some initial logging which might be of interest of that one 
> never-ending task:
> 2015-08-03 13:20:12,352 INFO [Initializer 1] orderedgrouped.Shuffle: Shuffle 
> assigned with 55 inputs, codec: NoneifileReadAhead: true
> 2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.ShuffleScheduler: 
> ShuffleScheduler running for sourceVertex: MapChainVertex:Workbook job (184): 
> ClickPairs#ClikDataWithPlayerSession(Disconnected record stream) 
> (4ae490d9-9d88-4ccd-a236-1d4dddb56060) with configuration: 
> maxFetchFailuresBeforeReporting=5, reportReadErrorImmediately=true, 
> maxFailedUniqueFetches=5, abortFailureLimit=30, maxMapRuntime=0
> 2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.MergeManager: 
> InitialRequest: ShuffleMem=1932735232 <tel:1932735232>, postMergeMem=0, 
> RuntimeTotalAvailable=1932735232 <tel:1932735232>. Updated to: 
> ShuffleMem=1932735232 <tel:1932735232>, postMergeMem: 0
> 2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.MergeManager: 
> MergerManager: memoryLimit=1932735232 <tel:1932735232>, 
> maxSingleShuffleLimit=483183808, mergeThreshold=1739461632 <tel:1739461632>, 
> ioSortFactor=100, memToMemMergeOutputsThreshold=100
> 
> 
> And then there are hours full of those things:
> 2015-08-03 13:33:59,911 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #36] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned 
> Status.WAIT ...
> 2015-08-03 13:33:59,911 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #36] 
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by 
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 3ms
> 2015-08-03 13:33:59,913 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for 
> url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006
>  
> <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006>
>  sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,913 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #36] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned 
> Status.WAIT ...
> 2015-08-03 13:33:59,913 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #36] 
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by 
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms
> 2015-08-03 13:33:59,914 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for 
> url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015
>  
> <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015>
>  sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,914 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #36] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned 
> Status.WAIT ...
> 2015-08-03 13:33:59,914 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #36] 
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by 
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms
> 2015-08-03 13:33:59,916 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for 
> url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006
>  
> <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006>
>  sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,916 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #36] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned 
> Status.WAIT ...
> 2015-08-03 13:33:59,916 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #36] 
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by 
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms
> 2015-08-03 13:33:59,917 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for 
> url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012
>  
> <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012>
>  sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,917 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #36] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned 
> Status.WAIT ...
> 2015-08-03 13:33:59,917 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #36] 
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by 
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms
> 2015-08-03 13:33:59,919 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for 
> url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006
>  
> <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006>
>  sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,919 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #36] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned 
> Status.WAIT ...
> 2015-08-03 13:33:59,919 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #36] 
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by 
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms
> 2015-08-03 13:33:59,920 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for 
> url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015
>  
> <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015>
>  sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,920 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #36] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned 
> Status.WAIT ...
> 2015-08-03 13:33:59,920 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #36] 
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by 
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms
> 2015-08-03 13:33:59,922 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #25] shuffle.HttpConnection: for 
> url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006
>  
> <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006>
>  sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,922 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #25] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#25 - MergerManager returned 
> Status.WAIT ...
> 2015-08-03 13:33:59,922 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #25] 
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by 
> fetcher [MapChainVertex_Workbook_job__184___Click] #25 in 2ms
> 2015-08-03 13:34:06,125 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #33] shuffle.HttpConnection: for 
> url=http://instance-8865.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000029_0_10039,attempt_1438596625066_0001_2_00_000052_0_10051,attempt_1438596625066_0001_2_00_000045_0_10015,attempt_1438596625066_0001_2_00_000049_0_10045,attempt_1438596625066_0001_2_00_000017_0_10009
>  
> <http://instance-8865.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000029_0_10039,attempt_1438596625066_0001_2_00_000052_0_10051,attempt_1438596625066_0001_2_00_000045_0_10015,attempt_1438596625066_0001_2_00_000049_0_10045,attempt_1438596625066_0001_2_00_000017_0_10009>
>  sent hash and receievd reply 0 ms
> 
> Focusing on an individual fetcher it looks like this over time:
> 2015-08-03 13:46:45,977 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #31] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned 
> Status.WAIT …
> …
> 2015-08-03 14:39:44,974 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #31] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned 
> Status.WAIT …
> …
> 2015-08-03 15:27:32,739 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #31] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned 
> Status.WAIT …
> …
> 2015-08-03 17:42:28,354 INFO [fetcher 
> [MapChainVertex_Workbook_job__184___Click] #31] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned 
> Status.WAIT …
> 
> 
> That behaviour is consistent for the customer.
> Using Tez-0.6…
> Any ideas what could cause this ?
> 
> Johannes
> 

Reply via email to