Re: Running tez jobs with data in memory

Rajesh Balamohan Mon, 30 Nov 2015 16:13:38 -0800

1. Is it possible to determine from the tez history logs, what the
bottleneck for a task/vertex is? Whether it is compute, disk or network?


- Vertex counters and task counters for the vertex can be looked into for
determine this. If you have enabled ATS, this would be available in TEZ-UI
itself. Otherwise it should be available in the job logs. However, it is
not always directly related to compute/disk/network.  Sometimes, the vertex
is delayed as it has to get the data from the source vertex (think of it
more like data dependency), sometimes due to re-execution of task in the
source vertex due to failures like disks, or due to cluster slot
unavailability and so on.  You can also look at using CriticalPathAnalyzer
(early version available in 0.8.x) which can help in determining the
critical path of the DAG (to determine whether the vertex was slow due to
different conditions). E.g HADOOP_CLASSPATH=$TEZ_HOME/*:/$TEZ_HOME/lib/*:$
HADOOP_CLASSPATH yarn jar $TEZ_HOME/tez-job-analyzer-0.8.2-SNAPSHOT.jar
CriticalPath --outputDir=/tmp/ --dagId=dag_1443665985063_58064_1

2. What are the common ways to get Tez work on data in memory, as opposed
to reading from HDFS. This is to minimize the duration mappers spend in
reading from HDFS or disk.

- Not sure if you are trying to compare with Spark way of loading the data
to memory and working on it.  Tez does not have a direct equivalent for
this;  But Tez has ObjectRegistry (look for BroadcastAndOneToOneExample
<https://github.com/apache/tez/blob/b153035b076d4603eb6bc771d675d64181eb02e9/tez-tests/src/main/java/org/apache/tez/mapreduce/examples/BroadcastAndOneToOneExample.java>
in tez codebase) where data can be stored in memory to share between tasks.

~Rajesh.B

On Tue, Dec 1, 2015 at 12:33 AM, Raajay <[email protected]> wrote:

> Hello,
>
> Two questions
>
> 1. Is it possible to determine from the tez history logs, what the
> bottleneck for a task/vertex is? Whether it is compute, disk or network?
>
> 2. What are the common ways to get Tez work on data in memory, as opposed
> to reading from HDFS. This is to minimize the duration mappers spend in
> reading from HDFS or disk.
>
> Thanks
> Raajay
>

Re: Running tez jobs with data in memory

Reply via email to