Great thanks ! Am i right in inferring that HDFS in-memory tier helps in speeding up writes and not reads ? Read might still happen from disk as there is no caching in the RAM.
One of the alternatives I was exploring was running tez atop Tachyon, but have not been able to get that working till now :( Raajay > On Nov 30, 2015, at 6:34 PM, Rajesh Balamohan <[email protected]> > wrote: > > Adding more to #2. Alternatively, you may want to consider adding paths to > HDFS in-memory tier > (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/MemoryStorage.html > > <https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/MemoryStorage.html>). > > ~Rajesh.B > > On Tue, Dec 1, 2015 at 5:41 AM, Rajesh Balamohan <[email protected] > <mailto:[email protected]>> wrote: > 1. Is it possible to determine from the tez history logs, what the bottleneck > for a task/vertex is? Whether it is compute, disk or network? > > - Vertex counters and task counters for the vertex can be looked into for > determine this. If you have enabled ATS, this would be available in TEZ-UI > itself. Otherwise it should be available in the job logs. However, it is not > always directly related to compute/disk/network. Sometimes, the vertex is > delayed as it has to get the data from the source vertex (think of it more > like data dependency), sometimes due to re-execution of task in the source > vertex due to failures like disks, or due to cluster slot unavailability and > so on. You can also look at using CriticalPathAnalyzer (early version > available in 0.8.x) which can help in determining the critical path of the > DAG (to determine whether the vertex was slow due to different conditions). > E.g HADOOP_CLASSPATH=$TEZ_HOME/*:/$TEZ_HOME/lib/*:$HADOOP_CLASSPATH yarn jar > $TEZ_HOME/tez-job-analyzer-0.8.2-SNAPSHOT.jar CriticalPath --outputDir=/tmp/ > --dagId=dag_1443665985063_58064_1 > > 2. What are the common ways to get Tez work on data in memory, as opposed to > reading from HDFS. This is to minimize the duration mappers spend in reading > from HDFS or disk. > > - Not sure if you are trying to compare with Spark way of loading the data to > memory and working on it. Tez does not have a direct equivalent for this; > But Tez has ObjectRegistry (look for BroadcastAndOneToOneExample > <https://github.com/apache/tez/blob/b153035b076d4603eb6bc771d675d64181eb02e9/tez-tests/src/main/java/org/apache/tez/mapreduce/examples/BroadcastAndOneToOneExample.java> > in tez codebase) where data can be stored in memory to share between tasks. > > ~Rajesh.B > > On Tue, Dec 1, 2015 at 12:33 AM, Raajay <[email protected] > <mailto:[email protected]>> wrote: > Hello, > > Two questions > > 1. Is it possible to determine from the tez history logs, what the bottleneck > for a task/vertex is? Whether it is compute, disk or network? > > 2. What are the common ways to get Tez work on data in memory, as opposed to > reading from HDFS. This is to minimize the duration mappers spend in reading > from HDFS or disk. > > Thanks > Raajay > > > > > -- > ~Rajesh.B
