RE: Running tez jobs with data in memory

Bikas Saha Tue, 01 Dec 2015 10:44:21 -0800

In HDFS in-memory tier the caching is best effort. So data is written to RAM
and asynchronously persisted to disk. This makes the data reliably available
despite memory pressure and machine reboot. So the application would be
functional.


 

This data would continue to reside in memory until there is memory pressure
and it needs to be released. So the read path would also benefit from
performance gains. There were some perf bottlenecks in the read path in HDFS
and they were fixed as part of the in memory tier changes.

 

Bikas

 

From: Raajay [mailto:[email protected]] 
Sent: Monday, November 30, 2015 4:41 PM
To: [email protected]
Subject: Re: Running tez jobs with data in memory

 

Great thanks ! Am i right in inferring that HDFS in-memory tier helps in
speeding up writes and not reads ? Read might still happen from disk as
there is no caching in the RAM.

 

One of the alternatives I was exploring was running tez atop Tachyon, but
have not been able to get that working till now :(

 

Raajay

 

On Nov 30, 2015, at 6:34 PM, Rajesh Balamohan <[email protected]
<mailto:[email protected]> > wrote:

 

Adding more to #2. Alternatively, you may want to consider adding paths to
HDFS in-memory tier
(https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/Memo
ryStorage.html).

 

~Rajesh.B

 

On Tue, Dec 1, 2015 at 5:41 AM, Rajesh Balamohan <[email protected]
<mailto:[email protected]> > wrote:

1. Is it possible to determine from the tez history logs, what the
bottleneck for a task/vertex is? Whether it is compute, disk or network?  

 

- Vertex counters and task counters for the vertex can be looked into for
determine this. If you have enabled ATS, this would be available in TEZ-UI
itself. Otherwise it should be available in the job logs. However, it is not
always directly related to compute/disk/network.  Sometimes, the vertex is
delayed as it has to get the data from the source vertex (think of it more
like data dependency), sometimes due to re-execution of task in the source
vertex due to failures like disks, or due to cluster slot unavailability and
so on.  You can also look at using CriticalPathAnalyzer (early version
available in 0.8.x) which can help in determining the critical path of the
DAG (to determine whether the vertex was slow due to different conditions).
E.g HADOOP_CLASSPATH=$TEZ_HOME/*:/$TEZ_HOME/lib/*:$HADOOP_CLASSPATH yarn jar
$TEZ_HOME/tez-job-analyzer-0.8.2-SNAPSHOT.jar CriticalPath --outputDir=/tmp/
--dagId=dag_1443665985063_58064_1

 

2. What are the common ways to get Tez work on data in memory, as opposed to
reading from HDFS. This is to minimize the duration mappers spend in reading
from HDFS or disk.

 

- Not sure if you are trying to compare with Spark way of loading the data
to memory and working on it.  Tez does not have a direct equivalent for
this;  But Tez has ObjectRegistry (look for BroadcastAndOneToOneExample
<https://github.com/apache/tez/blob/b153035b076d4603eb6bc771d675d64181eb02e9
/tez-tests/src/main/java/org/apache/tez/mapreduce/examples/BroadcastAndOneTo
OneExample.java>  in tez codebase) where data can be stored in memory to
share between tasks.

 

~Rajesh.B

 

On Tue, Dec 1, 2015 at 12:33 AM, Raajay <[email protected]
<mailto:[email protected]> > wrote:

Hello,

Two questions

1. Is it possible to determine from the tez history logs, what the
bottleneck for a task/vertex is? Whether it is compute, disk or network?  

2. What are the common ways to get Tez work on data in memory, as opposed to
reading from HDFS. This is to minimize the duration mappers spend in reading
from HDFS or disk.

Thanks

Raajay

 





 

-- 

~Rajesh.B

RE: Running tez jobs with data in memory

Reply via email to