On 9/24/14, 1:07 AM, Fabio wrote:
Hi guys,
I am trying to understand how Tez actually works inside. I opened the
tez wordcount example and I see at some point there are classes
referring back to hadoop mapreduce classes. For this reason and since I
see a Tokenizer node requires to finish before a Summation vertex can
start, I can't understand what's the difference with a normal mapreduce
job, where map(s) has to finish before reduce can start. There must be
something since the performance are clearly better than the hadoop's
wordcount example (even if I'd need dedicated machines to be sure about
this, I am running VMs on my quite old laptop).

The Tokenizer vertex needs to finish before the Summation vertex can finish.

The Summation vertex can start once all the Tokenizer vertices have been scheduled & a small fraction of them have completed.

Perhaps that is happening because you are running only 1 mapper, with the fraction always computing to zero until it finishes. For a distributed job, this is sort of how it runs through (time L->R).

http://people.apache.org/~gopalv/1TB-etl.svg

The wordcount example does indeed use some of the similar metaphors to the traditional map-reduce implementation, to help people who are migrating from an existing map-reduce setup.

Perhaps you can go over some of the Tez slides and get a clearer idea of how it builds on top of that basic foundation.

http://www.slideshare.net/t3rmin4t0r/tez-accelerating-data-pipelines/8

Also, I see tez has tez.runtime.io.sort.mb set at 32MB, so I set
mapreduce.task.io.sort.mb to 32 and make a comparison: I see the "old"
wordcount generates 21 spill files (something like
attempt_1411492856064_0001_m_000000_0_spill_XX.out) as big as 712KB
each, while with the tez wordcount I get 7 such files, but the size is
37MB each. Considering I am working on a 120MB input file (a little less
than one HDFS block), tez has to write way more than mapreduce on the
temporary dir. I thought these files are the intermediate map results,
but I can't see how they could be that larger then the original input
and than the original wordcount spill files.
I didn't enable any compression in the config file, working with Hadoop
2.5.0, Tez 0.5.0, master node + 2 slave. Just one slave is used for the
wordcount example (I always get 1 map/tokenizer and 1 reuduce/summation
running on the same node).

I think you have answered your own question there. Mapreduce's compression flags don't enable Tez's intermediate compression.

If you're missing a tez-site.xml at the moment, I keep one updated with my "known good" configs.

https://github.com/t3rmin4t0r/tez-autobuild/blob/master/tez-site.xml#L76

HTH, ask away if you have any more questions.

Cheers,
Gopal

Reply via email to