Are the runtime configs identical between 0.6 and 0.7?. E.g Do you see any
changes in tez.runtime.io.sort.mb between 0.6 and 0.7?

On Mon, Jun 13, 2016 at 9:20 AM, Sungwoo Park <[email protected]> wrote:

> Yes, we also think that that's what is happening. When there is 'Vertex
> re-running', we see Exceptions like:
>
> 2016-05-13 09:45:28,280 WARN [ResponseProcessor for block
> BP-358076261-10.1.91.4-1462195174702:blk_1075135562_1396252]
> hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block
> BP-358076261-10.1.91.4-14621951747\
> 02:blk_1075135562_1396252
> java.net.SocketTimeoutException: 65000 millis timeout while waiting for
> channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.1.91.11:37025 remote=/
> 10.1.91.11:50010]
>   at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>   at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>   at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>   at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
>   at java.io.FilterInputStream.read(FilterInputStream.java:83)
>   at java.io.FilterInputStream.read(FilterInputStream.java:83)
>   at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2201)
>   at
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
>   at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:867)
> 2016-05-13 09:45:28,283 INFO [TezChild] task.TezTaskRunner: Encounted an
> error while executing task: attempt_1462251281999_0859_1_01_000063_0
> java.io.IOException: All datanodes 10.1.91.11:50010 are bad. Aborting...
>   at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1206)
>   at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1004)
>   at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:548)
>
> However, this still does not explain that Tez 0.7 somehow runs much slower
> than Tez 0.6.2 when there arises no such Exception. As an example, see the
> summary of results below.  Tez 0.7 takes more than 1100 seconds to sort
> 40GB per node (where there is no Vertex re-running and no Exception other
> than at the end of the run), whereas Tez 0.6.2 spends just less than 800
> seconds. These results were obtained from the same cluster, using the
> default settings by HDP 2.4 (installed with Ambari), with very few
> differences in configuration settings.
>
> I believe that we missed some important settings for Tez 0.7 (because we
> did not install Hadoop and HDFS manually). Let me try the same experiment
> in other clusters and see what happens.
>
> Thanks for the response,
>
> --- Sungwoo
>
> tez-0.6.2-40GB_10nodes
> id              time            num_containers  mem             core
>      diag
> 0               761             239             525320249       82608
>
> 1               767             239             527612323       83271
>
> 2               736             239             520229980       82317
>
>
>
> tez-0.7.0-40GB_10nodes
>
> id              time            num_containers  mem             core
>      diag
> 0               1405            339             953706595       136624
>      Vertex re-running, vertexName=initialmap,
> vertexId=vertex_1462251281999_0859_1_00
> 1               1157            239             828765079       118293
>
> 2               1219            239             833052604       118151
>
>
>
> On Mon, Jun 13, 2016 at 6:21 AM, Bikas Saha <[email protected]> wrote:
>
>> From vertex re-running messages it seems like you may be seeing errors in
>> fetching shuffle data to reducers from mappers. This results in re-running
>> map vertex tasks to produce only the missing data – with the corresponding
>> vertex re-running message.
>>
>>
>>
>> Bikas
>>
>>
>>
>> *From:* Rajesh Balamohan [mailto:[email protected]]
>> *Sent:* Saturday, June 11, 2016 5:14 AM
>> *To:* [email protected]
>> *Subject:* Re: Question on Tez 0.6 and Tez 0.7
>>
>>
>>
>> Can you share the exceptions that happened in tez 0.7 runs.
>>
>> -Rajesh.B
>>
>> On 11-Jun-2016 16:28, "Sungwoo Park" <[email protected]> wrote:
>>
>> Hello,
>>
>>
>>
>> I have a question about the performance difference between Tez 0.6.2 and
>> Tez 0.7.0.
>>
>>
>>
>> This is what we did:
>>
>>
>>
>> 1. Installed HDP 2.4 on a 10-node cluster with default settings. No other
>> particular changes were made to the
>>
>> default settings recommended by HDP 2.4.
>>
>>
>>
>> 2. Ran TeraSort using Tez 0.6.2 and Tez 0.7.0, and compared the running
>> time.
>>
>>
>>
>> Each experiment specifies the amount of input data per node. For example,
>> 10GB_per_node means a total of
>>
>> 100GB input because there are 10 data nodes in the cluster.
>>
>>
>>
>> We've found that Tez 0.7.0 runs consistently slower than Tez 0.6.2,
>> producing 'Vertex re-running' errors quite
>>
>> often when the size of input data per node is over 40GB. Even when there
>> is no 'Vertex re-running', Tez 0.7.0
>>
>> took much longer than Tez 0.6.2.
>>
>>
>>
>> We know that Tez 0.7.0 runs faster than Tez 0.6.2, because on a cluster
>> of 44 nodes (with only 24GB memory per
>>
>> node), Tez 0.7.0 finished TeraSort almost as fast as Tez 0.6.2. We are
>> trying to figure out what we missed in
>>
>> the experiments on the 11-node cluster.
>>
>>
>>
>> Any help here would be appreciated. Thanks a lot.
>>
>>
>>
>> Sungwoo Park
>>
>>
>>
>> ----- Configuration
>>
>>
>>
>> HDP 2.4
>>
>> 11 nodes, 10 data nodes, each with 96GB memory, 6 x 500GB HDDs
>>
>> same HDFS, Yarn, MR
>>
>>
>>
>> Each mapper container uses 5GB.
>>
>> Each reducer container uses 10GB.
>>
>>
>>
>> Configurations specific to tez-0.6.0
>>
>> tez.runtime.sort.threads = 2
>>
>>
>>
>> Configurations specicfic to tez-0.7.0
>>
>> tez.grouping.max-size = 1073741824
>>
>> tez.runtime.sorter.class = PIPELINED
>>
>> tez.runtime.pipelined.sorter.sort.threads = 2
>>
>>
>>
>> ----- TEZ-0.6.2
>>
>>
>>
>> 10GB_per_node
>>
>> id              time            num_containers  mem             core
>>        diag
>>
>> 0               212             239             144695261       21873
>>
>> 1               204             239             139582665       20945
>>
>> 2               211             239             143477178       21700
>>
>>
>>
>> 20GB_per_node
>>
>> id              time            num_containers  mem             core
>>        diag
>>
>> 0               392             239             272528515       42367
>>
>> 1               402             239             273085026       42469
>>
>> 2               410             239             270118502       42111
>>
>>
>>
>> 40GB_per_node
>>
>> id              time            num_containers  mem             core
>>        diag
>>
>> 0               761             239             525320249       82608
>>
>> 1               767             239             527612323       83271
>>
>> 2               736             239             520229980       82317
>>
>>
>>
>> 80GB_per_node
>>
>> id              time            num_containers  mem             core
>>        diag
>>
>> 0               1564            239             1123903845      173915
>>
>> 1               1666            239             1161079968      178656
>>
>> 2               1628            239             1146656912      175998
>>
>>
>>
>> 160GB_per_node
>>
>> id              time            num_containers  mem             core
>>        diag
>>
>> 0               3689            239             2523160230      377563
>>
>> 1               3796            240             2610411363      388928
>>
>> 2               3624            239             2546652697      381400
>>
>>
>>
>> ----- TEZ-0.7.0
>>
>>
>>
>> 10GB_per_node
>>
>> id              time            num_containers  mem             core
>>        diag
>>
>> 0               262             239             179373935       26223
>>
>> 1               259             239             179375665       25767
>>
>> 2               271             239             186946086       26516
>>
>>
>>
>> 20GB_per_node
>>
>> id              time            num_containers  mem             core
>>        diag
>>
>> 0               572             239             380034060       55515
>>
>> 1               533             239             364082337       53555
>>
>> 2               515             239             356570788       52762
>>
>>
>>
>> 40GB_per_node
>>
>> id              time            num_containers  mem             core
>>        diag
>>
>> 0               1405            339             953706595       136624
>>        Vertex re-running
>>
>> 1               1157            239             828765079       118293
>>
>> 2               1219            239             833052604       118151
>>
>>
>>
>> 80GB_per_node
>>
>> id              time            num_containers  mem             core
>>        diag
>>
>> 0               3046            361             1999047193      279635
>>        Vertex re-running
>>
>> 1               2967            337             2079807505      290171
>>        Vertex re-running
>>
>> 2               3138            355             2030176406      282875
>>        Vertex re-running
>>
>>
>>
>> 160GB_per_node
>>
>> id              time            num_containers  mem             core
>>        diag
>>
>> 0               6832            436             4524472859      634518
>>        Vertex re-running
>>
>> 1               6233            365             4123693672      573259
>>        Vertex re-running
>>
>> 2               6133            379             4121812899      579044
>>        Vertex re-running
>>
>>
>


-- 
~Rajesh.B

Reply via email to