Tez application(DAG) monitoring/debugging issues

AnilKumar B Tue, 15 Jan 2019 13:49:30 -0800

Hi All,

I am new to Tez and currently in learning state. It will be great, if some
one clarifies on below issues. I know, I am asking lots of questions all
together, but they all are related.


We are facing resource utilization and latency related issues with few of
the long running queries with Tez and when we try to tune and debug,
observing below issues. May I know, is these issues commonly faced or is
there some other way to tune/debug/monitor?

*Environment :* Tez 0.8.4 (on emr-5.16.0)

*Issues:*
1. When I submit a query, Tez is using all available resources on cluster,
so in multi-tenant environment, we are facing lots of issues. (Even though
queue is configured, there is a contention with user queries in a given
queue). How to control the total Tez application wise resources? We are
trying with tez.grouping.(min/max_size), tez container sizes etc, but still
seeing issues on resource utilization.

2. In case of MR/Spark, users can predict how much time a given query
takes, where as in Tez, it's various time time in number of tasks, resource
utilization and latencies. Is this expected?

2. Few of the queries consumed huge resources like 2.3TB memory(which are
~available resources on YARN) and will not even completes after 12 hours.
Same query, if we run through Spark, with just 30% of resources completed
in 2 hours. In this case, the issue which we observed is Tez grouping
optimizations.

Is there any possibility to restrict the Tez to use input split computaion
similar to MR/Spark, instead of considering total available resources on
cluster/Queue? (Which is based on HadoopInputFormats)?

3.  When we try to monitor the TEZ DAG execution to understand what is
going on, at run time task logs are not available. So until it completes
that Vertex execution, we cannot see it's logs/counters, which is bit
problematic. For example in Spark, we can see the task logs at runtime and
which is very useful.

4. Most of the times, When Tez application fails saying "DAG did not
succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices", on console
logs it's not clear, which vertex caused the total DAG failure and even if
when we find the vertex id, we are unable to find the vertex in Tez web UI.
Is issue observed by anyone?

5. Sometimes observed that in DAG even though few of the vertices
successful and later when some other vertex fails, then it is marking as
all other SUCCEEDED vertices also marking as RED. May I know what is the
reason for this? How does it useful?

6. Even though Tez job failes, on YARN it is showing status as SUCCEEDED.
What is the reason for this?


Note: I can attach the queries but these observations are not specific to
one query, these are happening with few of the large tables, whose input
format is Text/RcFile, even for few Parquet File fomat based large tables
also we are observing these issue.


Thanks & Regards,
B Anil Kumar.

Tez application(DAG) monitoring/debugging issues

Reply via email to