>From vertex re-running messages it seems like you may be seeing errors in 
>fetching shuffle data to reducers from mappers. This results in re-running map 
>vertex tasks to produce only the missing data – with the corresponding vertex 
>re-running message.

 

Bikas

 

From: Rajesh Balamohan [mailto:[email protected]] 
Sent: Saturday, June 11, 2016 5:14 AM
To: [email protected]
Subject: Re: Question on Tez 0.6 and Tez 0.7

 

Can you share the exceptions that happened in tez 0.7 runs. 

-Rajesh.B

On 11-Jun-2016 16:28, "Sungwoo Park" <[email protected] 
<mailto:[email protected]> > wrote:

Hello,

 

I have a question about the performance difference between Tez 0.6.2 and Tez 
0.7.0.

 

This is what we did:

 

1. Installed HDP 2.4 on a 10-node cluster with default settings. No other 
particular changes were made to the

default settings recommended by HDP 2.4.

 

2. Ran TeraSort using Tez 0.6.2 and Tez 0.7.0, and compared the running time.

 

Each experiment specifies the amount of input data per node. For example, 
10GB_per_node means a total of

100GB input because there are 10 data nodes in the cluster.

 

We've found that Tez 0.7.0 runs consistently slower than Tez 0.6.2, producing 
'Vertex re-running' errors quite

often when the size of input data per node is over 40GB. Even when there is no 
'Vertex re-running', Tez 0.7.0

took much longer than Tez 0.6.2.

 

We know that Tez 0.7.0 runs faster than Tez 0.6.2, because on a cluster of 44 
nodes (with only 24GB memory per

node), Tez 0.7.0 finished TeraSort almost as fast as Tez 0.6.2. We are trying 
to figure out what we missed in 

the experiments on the 11-node cluster.

 

Any help here would be appreciated. Thanks a lot.

 

Sungwoo Park

 

----- Configuration

 

HDP 2.4

11 nodes, 10 data nodes, each with 96GB memory, 6 x 500GB HDDs

same HDFS, Yarn, MR

 

Each mapper container uses 5GB.

Each reducer container uses 10GB.

 

Configurations specific to tez-0.6.0

tez.runtime.sort.threads = 2

 

Configurations specicfic to tez-0.7.0

tez.grouping.max-size = 1073741824

tez.runtime.sorter.class = PIPELINED

tez.runtime.pipelined.sorter.sort.threads = 2

 

----- TEZ-0.6.2

 

10GB_per_node

id              time            num_containers  mem             core            
diag

0               212             239             144695261       21873

1               204             239             139582665       20945

2               211             239             143477178       21700

 

20GB_per_node

id              time            num_containers  mem             core            
diag

0               392             239             272528515       42367

1               402             239             273085026       42469

2               410             239             270118502       42111

 

40GB_per_node

id              time            num_containers  mem             core            
diag

0               761             239             525320249       82608

1               767             239             527612323       83271

2               736             239             520229980       82317

 

80GB_per_node

id              time            num_containers  mem             core            
diag

0               1564            239             1123903845 <tel:1123903845>     
  173915

1               1666            239             1161079968 <tel:1161079968>     
  178656

2               1628            239             1146656912 <tel:1146656912>     
  175998

 

160GB_per_node

id              time            num_containers  mem             core            
diag

0               3689            239             2523160230      377563

1               3796            240             2610411363      388928

2               3624            239             2546652697      381400

 

----- TEZ-0.7.0

 

10GB_per_node

id              time            num_containers  mem             core            
diag

0               262             239             179373935       26223

1               259             239             179375665       25767

2               271             239             186946086       26516

 

20GB_per_node

id              time            num_containers  mem             core            
diag

0               572             239             380034060       55515

1               533             239             364082337       53555

2               515             239             356570788       52762

 

40GB_per_node

id              time            num_containers  mem             core            
diag

0               1405            339             953706595       136624          
Vertex re-running

1               1157            239             828765079       118293

2               1219            239             833052604       118151

 

80GB_per_node

id              time            num_containers  mem             core            
diag

0               3046            361             1999047193      279635          
Vertex re-running

1               2967            337             2079807505 <tel:2079807505>     
  290171          Vertex re-running

2               3138            355             2030176406 <tel:2030176406>     
  282875          Vertex re-running

 

160GB_per_node

id              time            num_containers  mem             core            
diag

0               6832            436             4524472859 <tel:4524472859>     
  634518          Vertex re-running

1               6233            365             4123693672      573259          
Vertex re-running

2               6133            379             4121812899      579044          
Vertex re-running

Reply via email to