Hello,
I have a question about the performance difference between Tez 0.6.2 and
Tez 0.7.0.
This is what we did:
1. Installed HDP 2.4 on a 10-node cluster with default settings. No other
particular changes were made to the
default settings recommended by HDP 2.4.
2. Ran TeraSort using Tez 0.6.2 and Tez 0.7.0, and compared the running
time.
Each experiment specifies the amount of input data per node. For example,
10GB_per_node means a total of
100GB input because there are 10 data nodes in the cluster.
We've found that Tez 0.7.0 runs consistently slower than Tez 0.6.2,
producing 'Vertex re-running' errors quite
often when the size of input data per node is over 40GB. Even when there is
no 'Vertex re-running', Tez 0.7.0
took much longer than Tez 0.6.2.
We know that Tez 0.7.0 runs faster than Tez 0.6.2, because on a cluster of
44 nodes (with only 24GB memory per
node), Tez 0.7.0 finished TeraSort almost as fast as Tez 0.6.2. We are
trying to figure out what we missed in
the experiments on the 11-node cluster.
Any help here would be appreciated. Thanks a lot.
Sungwoo Park
----- Configuration
HDP 2.4
11 nodes, 10 data nodes, each with 96GB memory, 6 x 500GB HDDs
same HDFS, Yarn, MR
Each mapper container uses 5GB.
Each reducer container uses 10GB.
Configurations specific to tez-0.6.0
tez.runtime.sort.threads = 2
Configurations specicfic to tez-0.7.0
tez.grouping.max-size = 1073741824
tez.runtime.sorter.class = PIPELINED
tez.runtime.pipelined.sorter.sort.threads = 2
----- TEZ-0.6.2
10GB_per_node
id time num_containers mem core
diag
0 212 239 144695261 21873
1 204 239 139582665 20945
2 211 239 143477178 21700
20GB_per_node
id time num_containers mem core
diag
0 392 239 272528515 42367
1 402 239 273085026 42469
2 410 239 270118502 42111
40GB_per_node
id time num_containers mem core
diag
0 761 239 525320249 82608
1 767 239 527612323 83271
2 736 239 520229980 82317
80GB_per_node
id time num_containers mem core
diag
0 1564 239 1123903845 173915
1 1666 239 1161079968 178656
2 1628 239 1146656912 175998
160GB_per_node
id time num_containers mem core
diag
0 3689 239 2523160230 377563
1 3796 240 2610411363 388928
2 3624 239 2546652697 381400
----- TEZ-0.7.0
10GB_per_node
id time num_containers mem core
diag
0 262 239 179373935 26223
1 259 239 179375665 25767
2 271 239 186946086 26516
20GB_per_node
id time num_containers mem core
diag
0 572 239 380034060 55515
1 533 239 364082337 53555
2 515 239 356570788 52762
40GB_per_node
id time num_containers mem core
diag
0 1405 339 953706595 136624
Vertex re-running
1 1157 239 828765079 118293
2 1219 239 833052604 118151
80GB_per_node
id time num_containers mem core
diag
0 3046 361 1999047193 279635
Vertex re-running
1 2967 337 2079807505 290171
Vertex re-running
2 3138 355 2030176406 282875
Vertex re-running
160GB_per_node
id time num_containers mem core
diag
0 6832 436 4524472859 634518
Vertex re-running
1 6233 365 4123693672 573259
Vertex re-running
2 6133 379 4121812899 579044
Vertex re-running