>From vertex re-running messages it seems like you may be seeing errors in >fetching shuffle data to reducers from mappers. This results in re-running map >vertex tasks to produce only the missing data – with the corresponding vertex >re-running message.
Bikas From: Rajesh Balamohan [mailto:[email protected]] Sent: Saturday, June 11, 2016 5:14 AM To: [email protected] Subject: Re: Question on Tez 0.6 and Tez 0.7 Can you share the exceptions that happened in tez 0.7 runs. -Rajesh.B On 11-Jun-2016 16:28, "Sungwoo Park" <[email protected] <mailto:[email protected]> > wrote: Hello, I have a question about the performance difference between Tez 0.6.2 and Tez 0.7.0. This is what we did: 1. Installed HDP 2.4 on a 10-node cluster with default settings. No other particular changes were made to the default settings recommended by HDP 2.4. 2. Ran TeraSort using Tez 0.6.2 and Tez 0.7.0, and compared the running time. Each experiment specifies the amount of input data per node. For example, 10GB_per_node means a total of 100GB input because there are 10 data nodes in the cluster. We've found that Tez 0.7.0 runs consistently slower than Tez 0.6.2, producing 'Vertex re-running' errors quite often when the size of input data per node is over 40GB. Even when there is no 'Vertex re-running', Tez 0.7.0 took much longer than Tez 0.6.2. We know that Tez 0.7.0 runs faster than Tez 0.6.2, because on a cluster of 44 nodes (with only 24GB memory per node), Tez 0.7.0 finished TeraSort almost as fast as Tez 0.6.2. We are trying to figure out what we missed in the experiments on the 11-node cluster. Any help here would be appreciated. Thanks a lot. Sungwoo Park ----- Configuration HDP 2.4 11 nodes, 10 data nodes, each with 96GB memory, 6 x 500GB HDDs same HDFS, Yarn, MR Each mapper container uses 5GB. Each reducer container uses 10GB. Configurations specific to tez-0.6.0 tez.runtime.sort.threads = 2 Configurations specicfic to tez-0.7.0 tez.grouping.max-size = 1073741824 tez.runtime.sorter.class = PIPELINED tez.runtime.pipelined.sorter.sort.threads = 2 ----- TEZ-0.6.2 10GB_per_node id time num_containers mem core diag 0 212 239 144695261 21873 1 204 239 139582665 20945 2 211 239 143477178 21700 20GB_per_node id time num_containers mem core diag 0 392 239 272528515 42367 1 402 239 273085026 42469 2 410 239 270118502 42111 40GB_per_node id time num_containers mem core diag 0 761 239 525320249 82608 1 767 239 527612323 83271 2 736 239 520229980 82317 80GB_per_node id time num_containers mem core diag 0 1564 239 1123903845 <tel:1123903845> 173915 1 1666 239 1161079968 <tel:1161079968> 178656 2 1628 239 1146656912 <tel:1146656912> 175998 160GB_per_node id time num_containers mem core diag 0 3689 239 2523160230 377563 1 3796 240 2610411363 388928 2 3624 239 2546652697 381400 ----- TEZ-0.7.0 10GB_per_node id time num_containers mem core diag 0 262 239 179373935 26223 1 259 239 179375665 25767 2 271 239 186946086 26516 20GB_per_node id time num_containers mem core diag 0 572 239 380034060 55515 1 533 239 364082337 53555 2 515 239 356570788 52762 40GB_per_node id time num_containers mem core diag 0 1405 339 953706595 136624 Vertex re-running 1 1157 239 828765079 118293 2 1219 239 833052604 118151 80GB_per_node id time num_containers mem core diag 0 3046 361 1999047193 279635 Vertex re-running 1 2967 337 2079807505 <tel:2079807505> 290171 Vertex re-running 2 3138 355 2030176406 <tel:2030176406> 282875 Vertex re-running 160GB_per_node id time num_containers mem core diag 0 6832 436 4524472859 <tel:4524472859> 634518 Vertex re-running 1 6233 365 4123693672 573259 Vertex re-running 2 6133 379 4121812899 579044 Vertex re-running
