Hey Hitesh, so i created https://issues.apache.org/jira/browse/TEZ-2148 Note that everything is very vague since everything we got second hand from a customer. So i can’t answer most of you questions below. However i attached application and client logs from a typical “slow” Tez run and a typical “fast” MapReduce run. Maybe you guys can spot something there more easily!
The tez version btw is 0.5.1 (was wrong before)! Any insights appreciated! Johannes > On 23 Feb 2015, at 18:59, Hitesh Shah <[email protected]> wrote: > > Thanks for the feedback, Johannes. > > Would it be possible for you to file a jira for the performance issue that > you are seeing with logs? Please strip out necessary data to hide the > customer info, etc. The logs that would be most useful are: > - comparison logs of an MR job vs a Tez job showing the container launch > slowness/delay > - Tez logs from a job submitted to a busy cluster vs a free cluster > > To confirm, you are using 0.5.2? > > Also, some questions on the env/job runs if you can help answer them ( before > I jump to any possible conclusions :) ) : > - was the performance difference in the case where you were running > multiple jobs concurrently submitted to the same queue? ( Given that all jobs > are submitted to the same queue, the question on capacity scheduler > preemption is moot) > - are all jobs running as the same user ( from a YARN perspective i.e. > ignoring impersonation ) ? > - When you mention that Tez was slow at launching containers, do you know > whether the queue had sufficient resources to launch the new containers at > the time this was observed? > - If the answer to the first question was a concurrency test, were > containers being held up by other idle AMs and therefore starving out other > AMs that were doing useful work? > - What was tez configured to in terms of how long it should hold on to > containers and how many of them ( container.idle.release-timeout* , > session.min.held-containers properties )? > > thanks > — Hitesh > > On Feb 23, 2015, at 8:52 AM, Johannes Zillmann <[email protected]> > wrote: > >> Ok, turned out that we calculated resources for MapReduce and Tez >> differently and thus over-combined splits in Tez which lead to a sacrifice >> in the split count! >> However, still MapReduce outperformed Tez in a lot of runs. After multiple >> iterations over the issue (deployment at a customer we have limited access) >> things look like that: >> >> - customer has capacity scheduler configured (configuration attached, our >> product uses the productA queue) >> - if the cluster is completely free of use, Tez outperforms MapReduce >> - when the cluster is in use, MapReduce seems to always outperform Tez >> >> So questions is, is there some difference in how Tez is grabbing resources >> from the capacity scheduler in difference to MapReduce ? >> Looking at the logs it looks like Tez is always very slow in starting the >> containers where as MapReduce parallelizes very quickly. >> >> Any thoughts on that ? >> >> Johannes >> >> <capacity-scheduler.xml> >>> On 09 Feb 2015, at 19:24, Siddharth Seth <[email protected]> wrote: >>> >>> Johannes, >>> How many tasks end up running for this specific vertex ? Is it more than a >>> single wave of tasks (number of containers available on the cluster?). >>> Tez ends up allocating already running containers depending on >>> configuration. Tuning these may help - >>> tez.am.container.reuse.locality.delay-allocation-millis - Increase this to >>> a higher value, for re-use to be less aggressive (default is 250 (ms)) >>> tez.am.container.reuse.rack-fallback.enabled - enable/disable rack fallback >>> re-use >>> tez.am.container.reuse.non-local-fallback.enabled - enable/disable >>> non-local re-use >>> >>> You could try disabling container re-use completely to see if the situation >>> improves. >>> Also - how many tasks are generated for MapReduce vs Tez ? >>> >>> Thanks >>> - Sid >>> >>> On Mon, Feb 9, 2015 at 8:18 AM, Johannes Zillmann >>> <[email protected]> wrote: >>> Hey guys, >>> >>> have a question about data locality in Tez. >>> Same type of input and computation logic. >>> Map reduce data locality: 95 % >>> Tez data locality: 50 % >>> >>> Having a custom InputInitializer where i’ doing like this: >>> >>> InputSplit[] splits = inputFormat.getSplits(conf, desiredSplits); >>> >>> List<Event> events = Lists.newArrayList(); >>> List<TaskLocationHint> locationHints = Lists.newArrayList(); >>> for (InputSplit split : splits) { >>> >>> locationHints.add(TaskLocationHint.createTaskLocationHint(split.getLocations(), >>> null)); >>> } >>> VertexLocationHint locationHint = >>> VertexLocationHint.create(locationHints); >>> >>> InputConfigureVertexTasksEvent configureVertexEvent = >>> InputConfigureVertexTasksEvent.create(splits.size(), locationHint, >>> InputSpecUpdate.getDefaultSinglePhysicalInputSpecUpdate()); >>> events.add(configureVertexEvent); >>> for (TezSplit split : splits) { >>> >>> events.add(InputDataInformationEvent.createWithSerializedPayload(events.size() >>> - 1, ByteBuffer.wrap(split.toByteArray()))); >>> } >>> >>> Any obvious flaw here ? >>> Or an explanation why data locality is worse ? >>> >>> best >>> Johannes >>> >> >
