Hey Tez Community, a customer of ours has following situation. He sets up a computation pipeline in a way that 2 independent graphs are running in an Tez DAG. Result: the job is still running after a half day with only alive task running, seems to running forever.
If he splits the pipeline that each of the independent graph are executed in their own DAG, then each of the DAG’s are completed in one or two hours. Here is some initial logging which might be of interest of that one never-ending task: 2015-08-03 13:20:12,352 INFO [Initializer 1] orderedgrouped.Shuffle: Shuffle assigned with 55 inputs, codec: NoneifileReadAhead: true 2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.ShuffleScheduler: ShuffleScheduler running for sourceVertex: MapChainVertex:Workbook job (184): ClickPairs#ClikDataWithPlayerSession(Disconnected record stream) (4ae490d9-9d88-4ccd-a236-1d4dddb56060) with configuration: maxFetchFailuresBeforeReporting=5, reportReadErrorImmediately=true, maxFailedUniqueFetches=5, abortFailureLimit=30, maxMapRuntime=0 2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.MergeManager: InitialRequest: ShuffleMem=1932735232, postMergeMem=0, RuntimeTotalAvailable=1932735232. Updated to: ShuffleMem=1932735232, postMergeMem: 0 2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.MergeManager: MergerManager: memoryLimit=1932735232, maxSingleShuffleLimit=483183808, mergeThreshold=1739461632, ioSortFactor=100, memToMemMergeOutputsThreshold=100 And then there are hours full of those things: 2015-08-03 13:33:59,911 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ... 2015-08-03 13:33:59,911 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 3ms 2015-08-03 13:33:59,913 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006 sent hash and receievd reply 0 ms 2015-08-03 13:33:59,913 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ... 2015-08-03 13:33:59,913 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms 2015-08-03 13:33:59,914 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015 sent hash and receievd reply 0 ms 2015-08-03 13:33:59,914 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ... 2015-08-03 13:33:59,914 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms 2015-08-03 13:33:59,916 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006 sent hash and receievd reply 0 ms 2015-08-03 13:33:59,916 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ... 2015-08-03 13:33:59,916 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms 2015-08-03 13:33:59,917 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012 sent hash and receievd reply 0 ms 2015-08-03 13:33:59,917 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ... 2015-08-03 13:33:59,917 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms 2015-08-03 13:33:59,919 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006 sent hash and receievd reply 0 ms 2015-08-03 13:33:59,919 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ... 2015-08-03 13:33:59,919 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms 2015-08-03 13:33:59,920 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015 sent hash and receievd reply 0 ms 2015-08-03 13:33:59,920 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ... 2015-08-03 13:33:59,920 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms 2015-08-03 13:33:59,922 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #25] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006 sent hash and receievd reply 0 ms 2015-08-03 13:33:59,922 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #25] orderedgrouped.FetcherOrderedGrouped: fetcher#25 - MergerManager returned Status.WAIT ... 2015-08-03 13:33:59,922 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #25] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #25 in 2ms 2015-08-03 13:34:06,125 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #33] shuffle.HttpConnection: for url=http://instance-8865.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000029_0_10039,attempt_1438596625066_0001_2_00_000052_0_10051,attempt_1438596625066_0001_2_00_000045_0_10015,attempt_1438596625066_0001_2_00_000049_0_10045,attempt_1438596625066_0001_2_00_000017_0_10009 sent hash and receievd reply 0 ms Focusing on an individual fetcher it looks like this over time: 2015-08-03 13:46:45,977 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #31] orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned Status.WAIT … … 2015-08-03 14:39:44,974 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #31] orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned Status.WAIT … … 2015-08-03 15:27:32,739 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #31] orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned Status.WAIT … … 2015-08-03 17:42:28,354 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #31] orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned Status.WAIT … That behaviour is consistent for the customer. Using Tez-0.6… Any ideas what could cause this ? Johannes
