[ 
https://issues.apache.org/jira/browse/TEZ-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340796#comment-14340796
 ] 

Bikas Saha commented on TEZ-2148:
---------------------------------

[~oae] Do you have access to the tez-site.xml? Since this happens with busy 
clusters I am guessing that this is a result of contention between multiple 
sessions such that newer sessions starve when older sessions have not yet 
released idle resources. 
Do you observe that the first Tez flow is comparable/faster than the first MR 
flow and the subsequent Tez flows start slowing down compared to MR? Do 
multiple sets of 4 DAGs go to the same Tez session or do you close the session 
after the 4 DAGs have completed and start a new session for the next query?
What are the values of tez.am.container.idle.release-timeout-min.millis and 
tez.am.container.idle.release-timeout-max.millis and 
tez.am.session.min.held-containers? Their default values are 5s 10s and 0. 
Which means that idle resource will be released between 5 to 10s and all idle 
containers will be released.

One way to test this theory would be to turn the min/max timeout to low values 
like 250 ms or 100ms and try it out. This should release containers quickly to 
other sessions.
If you are submitting multiple queries to a pool of running sessions (where 
each query can be multiple DAGs) then what you could do it set low values of 
min/max timeouts such that they serve your need for within DAG reuse and set a 
value of tez.am.session.min.held-containers such that your sessions have enough 
min held containers to get good initial latency while new containers are being 
acquired to complete the remainder of the DAG. min-held-containers are tried to 
be spread evenly across the machines in the clusters for good initial locality. 
So having 1 per machine (if possible) would be good.

> Slow container grabbing with Capacity Scheduler in comparision to MapReduce
> ---------------------------------------------------------------------------
>
>                 Key: TEZ-2148
>                 URL: https://issues.apache.org/jira/browse/TEZ-2148
>             Project: Apache Tez
>          Issue Type: Task
>    Affects Versions: 0.5.1
>            Reporter: Johannes Zillmann
>         Attachments: TEZ-2148.svg, applicationLogs.zip, 
> capacity-scheduler.xml, client-mapreduce.log, client-tez.log, dag1.pdf, 
> dag2.pdf, dag3.pdf, dag4.pdf
>
>
> A customer experienced the following:
> - Setup a CapacityScheduler for user 'company'
> - Same processing job on same data is faster with MapReduce then with Tez 
> with "normal" cluster business. Only if nothing else runs on Hadoop then Tez 
> outperforms MapReduce. (Its hard to give exact data here since we get every 
> information second hand from the customer, but the timings were pretty stable 
> over a dozen of runs. The MapReduce job in about 70 sec and Tez in about 170 
> sec.)
> So questions is, is there some difference in how Tez is grabbing resources 
> from the capacity scheduler in difference to MapReduce ?
> Looking at the logs it looks like Tez is always very slow in starting the 
> containers where as MapReduce parallelizes very quickly.
> Attached client and application logs for Tez and MapReduce run as well as the 
> scheduler configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to