[ https://issues.apache.org/jira/browse/TEZ-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340796#comment-14340796 ]
Bikas Saha commented on TEZ-2148: --------------------------------- [~oae] Do you have access to the tez-site.xml? Since this happens with busy clusters I am guessing that this is a result of contention between multiple sessions such that newer sessions starve when older sessions have not yet released idle resources. Do you observe that the first Tez flow is comparable/faster than the first MR flow and the subsequent Tez flows start slowing down compared to MR? Do multiple sets of 4 DAGs go to the same Tez session or do you close the session after the 4 DAGs have completed and start a new session for the next query? What are the values of tez.am.container.idle.release-timeout-min.millis and tez.am.container.idle.release-timeout-max.millis and tez.am.session.min.held-containers? Their default values are 5s 10s and 0. Which means that idle resource will be released between 5 to 10s and all idle containers will be released. One way to test this theory would be to turn the min/max timeout to low values like 250 ms or 100ms and try it out. This should release containers quickly to other sessions. If you are submitting multiple queries to a pool of running sessions (where each query can be multiple DAGs) then what you could do it set low values of min/max timeouts such that they serve your need for within DAG reuse and set a value of tez.am.session.min.held-containers such that your sessions have enough min held containers to get good initial latency while new containers are being acquired to complete the remainder of the DAG. min-held-containers are tried to be spread evenly across the machines in the clusters for good initial locality. So having 1 per machine (if possible) would be good. > Slow container grabbing with Capacity Scheduler in comparision to MapReduce > --------------------------------------------------------------------------- > > Key: TEZ-2148 > URL: https://issues.apache.org/jira/browse/TEZ-2148 > Project: Apache Tez > Issue Type: Task > Affects Versions: 0.5.1 > Reporter: Johannes Zillmann > Attachments: TEZ-2148.svg, applicationLogs.zip, > capacity-scheduler.xml, client-mapreduce.log, client-tez.log, dag1.pdf, > dag2.pdf, dag3.pdf, dag4.pdf > > > A customer experienced the following: > - Setup a CapacityScheduler for user 'company' > - Same processing job on same data is faster with MapReduce then with Tez > with "normal" cluster business. Only if nothing else runs on Hadoop then Tez > outperforms MapReduce. (Its hard to give exact data here since we get every > information second hand from the customer, but the timings were pretty stable > over a dozen of runs. The MapReduce job in about 70 sec and Tez in about 170 > sec.) > So questions is, is there some difference in how Tez is grabbing resources > from the capacity scheduler in difference to MapReduce ? > Looking at the logs it looks like Tez is always very slow in starting the > containers where as MapReduce parallelizes very quickly. > Attached client and application logs for Tez and MapReduce run as well as the > scheduler configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)