In HDFS in-memory tier the caching is best effort. So data is written to RAM and asynchronously persisted to disk. This makes the data reliably available despite memory pressure and machine reboot. So the application would be functional.
This data would continue to reside in memory until there is memory pressure and it needs to be released. So the read path would also benefit from performance gains. There were some perf bottlenecks in the read path in HDFS and they were fixed as part of the in memory tier changes. Bikas From: Raajay [mailto:[email protected]] Sent: Monday, November 30, 2015 4:41 PM To: [email protected] Subject: Re: Running tez jobs with data in memory Great thanks ! Am i right in inferring that HDFS in-memory tier helps in speeding up writes and not reads ? Read might still happen from disk as there is no caching in the RAM. One of the alternatives I was exploring was running tez atop Tachyon, but have not been able to get that working till now :( Raajay On Nov 30, 2015, at 6:34 PM, Rajesh Balamohan <[email protected] <mailto:[email protected]> > wrote: Adding more to #2. Alternatively, you may want to consider adding paths to HDFS in-memory tier (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/Memo ryStorage.html). ~Rajesh.B On Tue, Dec 1, 2015 at 5:41 AM, Rajesh Balamohan <[email protected] <mailto:[email protected]> > wrote: 1. Is it possible to determine from the tez history logs, what the bottleneck for a task/vertex is? Whether it is compute, disk or network? - Vertex counters and task counters for the vertex can be looked into for determine this. If you have enabled ATS, this would be available in TEZ-UI itself. Otherwise it should be available in the job logs. However, it is not always directly related to compute/disk/network. Sometimes, the vertex is delayed as it has to get the data from the source vertex (think of it more like data dependency), sometimes due to re-execution of task in the source vertex due to failures like disks, or due to cluster slot unavailability and so on. You can also look at using CriticalPathAnalyzer (early version available in 0.8.x) which can help in determining the critical path of the DAG (to determine whether the vertex was slow due to different conditions). E.g HADOOP_CLASSPATH=$TEZ_HOME/*:/$TEZ_HOME/lib/*:$HADOOP_CLASSPATH yarn jar $TEZ_HOME/tez-job-analyzer-0.8.2-SNAPSHOT.jar CriticalPath --outputDir=/tmp/ --dagId=dag_1443665985063_58064_1 2. What are the common ways to get Tez work on data in memory, as opposed to reading from HDFS. This is to minimize the duration mappers spend in reading from HDFS or disk. - Not sure if you are trying to compare with Spark way of loading the data to memory and working on it. Tez does not have a direct equivalent for this; But Tez has ObjectRegistry (look for BroadcastAndOneToOneExample <https://github.com/apache/tez/blob/b153035b076d4603eb6bc771d675d64181eb02e9 /tez-tests/src/main/java/org/apache/tez/mapreduce/examples/BroadcastAndOneTo OneExample.java> in tez codebase) where data can be stored in memory to share between tasks. ~Rajesh.B On Tue, Dec 1, 2015 at 12:33 AM, Raajay <[email protected] <mailto:[email protected]> > wrote: Hello, Two questions 1. Is it possible to determine from the tez history logs, what the bottleneck for a task/vertex is? Whether it is compute, disk or network? 2. What are the common ways to get Tez work on data in memory, as opposed to reading from HDFS. This is to minimize the duration mappers spend in reading from HDFS or disk. Thanks Raajay -- ~Rajesh.B
