Hi All, I am new to Tez and currently in learning state. It will be great, if some one clarifies on below issues. I know, I am asking lots of questions all together, but they all are related.
We are facing resource utilization and latency related issues with few of the long running queries with Tez and when we try to tune and debug, observing below issues. May I know, is these issues commonly faced or is there some other way to tune/debug/monitor? *Environment :* Tez 0.8.4 (on emr-5.16.0) *Issues:* 1. When I submit a query, Tez is using all available resources on cluster, so in multi-tenant environment, we are facing lots of issues. (Even though queue is configured, there is a contention with user queries in a given queue). How to control the total Tez application wise resources? We are trying with tez.grouping.(min/max_size), tez container sizes etc, but still seeing issues on resource utilization. 2. In case of MR/Spark, users can predict how much time a given query takes, where as in Tez, it's various time time in number of tasks, resource utilization and latencies. Is this expected? 2. Few of the queries consumed huge resources like 2.3TB memory(which are ~available resources on YARN) and will not even completes after 12 hours. Same query, if we run through Spark, with just 30% of resources completed in 2 hours. In this case, the issue which we observed is Tez grouping optimizations. Is there any possibility to restrict the Tez to use input split computaion similar to MR/Spark, instead of considering total available resources on cluster/Queue? (Which is based on HadoopInputFormats)? 3. When we try to monitor the TEZ DAG execution to understand what is going on, at run time task logs are not available. So until it completes that Vertex execution, we cannot see it's logs/counters, which is bit problematic. For example in Spark, we can see the task logs at runtime and which is very useful. 4. Most of the times, When Tez application fails saying "DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices", on console logs it's not clear, which vertex caused the total DAG failure and even if when we find the vertex id, we are unable to find the vertex in Tez web UI. Is issue observed by anyone? 5. Sometimes observed that in DAG even though few of the vertices successful and later when some other vertex fails, then it is marking as all other SUCCEEDED vertices also marking as RED. May I know what is the reason for this? How does it useful? 6. Even though Tez job failes, on YARN it is showing status as SUCCEEDED. What is the reason for this? Note: I can attach the queries but these observations are not specific to one query, these are happening with few of the large tables, whose input format is Text/RcFile, even for few Parquet File fomat based large tables also we are observing these issue. Thanks & Regards, B Anil Kumar.
