Re: Running tez jobs with data in memory

Raajay Mon, 30 Nov 2015 16:41:57 -0800

Great thanks ! Am i right in inferring that HDFS in-memory tier helps in 
speeding up writes and not reads ? Read might still happen from disk as there 
is no caching in the RAM.


One of the alternatives I was exploring was running tez atop Tachyon, but have 
not been able to get that working till now :(

Raajay

> On Nov 30, 2015, at 6:34 PM, Rajesh Balamohan <[email protected]> 
> wrote:
> 
> Adding more to #2. Alternatively, you may want to consider adding paths to 
> HDFS in-memory tier 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/MemoryStorage.html
>  
> <https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/MemoryStorage.html>).
> 
> ~Rajesh.B
> 
> On Tue, Dec 1, 2015 at 5:41 AM, Rajesh Balamohan <[email protected] 
> <mailto:[email protected]>> wrote:
> 1. Is it possible to determine from the tez history logs, what the bottleneck 
> for a task/vertex is? Whether it is compute, disk or network?  
> 
> - Vertex counters and task counters for the vertex can be looked into for 
> determine this. If you have enabled ATS, this would be available in TEZ-UI 
> itself. Otherwise it should be available in the job logs. However, it is not 
> always directly related to compute/disk/network.  Sometimes, the vertex is 
> delayed as it has to get the data from the source vertex (think of it more 
> like data dependency), sometimes due to re-execution of task in the source 
> vertex due to failures like disks, or due to cluster slot unavailability and 
> so on.  You can also look at using CriticalPathAnalyzer (early version 
> available in 0.8.x) which can help in determining the critical path of the 
> DAG (to determine whether the vertex was slow due to different conditions). 
> E.g HADOOP_CLASSPATH=$TEZ_HOME/*:/$TEZ_HOME/lib/*:$HADOOP_CLASSPATH yarn jar 
> $TEZ_HOME/tez-job-analyzer-0.8.2-SNAPSHOT.jar CriticalPath --outputDir=/tmp/ 
> --dagId=dag_1443665985063_58064_1
> 
> 2. What are the common ways to get Tez work on data in memory, as opposed to 
> reading from HDFS. This is to minimize the duration mappers spend in reading 
> from HDFS or disk.
> 
> - Not sure if you are trying to compare with Spark way of loading the data to 
> memory and working on it.  Tez does not have a direct equivalent for this;  
> But Tez has ObjectRegistry (look for BroadcastAndOneToOneExample 
> <https://github.com/apache/tez/blob/b153035b076d4603eb6bc771d675d64181eb02e9/tez-tests/src/main/java/org/apache/tez/mapreduce/examples/BroadcastAndOneToOneExample.java>
>  in tez codebase) where data can be stored in memory to share between tasks.
> 
> ~Rajesh.B
> 
> On Tue, Dec 1, 2015 at 12:33 AM, Raajay <[email protected] 
> <mailto:[email protected]>> wrote:
> Hello,
> 
> Two questions
> 
> 1. Is it possible to determine from the tez history logs, what the bottleneck 
> for a task/vertex is? Whether it is compute, disk or network?  
> 
> 2. What are the common ways to get Tez work on data in memory, as opposed to 
> reading from HDFS. This is to minimize the duration mappers spend in reading 
> from HDFS or disk.
> 
> Thanks
> Raajay
> 
> 
> 
> 
> -- 
> ~Rajesh.B

Re: Running tez jobs with data in memory

Reply via email to