On Oct 31, 2009, at 11:22 AM, Rob Stewart wrote:
<snip>
Map and reduce parallelism are controlled differently in Hadoop. Map
parallelism is controlled by the InputSplit. IS determines how
many maps to
start and which file blocks to assign to which maps. In the case
of PigMix,
both the MR Java code and the Pig code use some subclass of
FileInputFormat,
so the map parallelism is the same in both tests. I do not know
for sure,
but I believe Hive also uses FileInputFormat.
Reduce parallelism is set explicitly as part of the job
configuration. In
MapReduce this is done through the Java API. In Pig it is done
through
though the PARALLEL command. In PigMix, we set parallelism for
both the
same (40 I believe for this data size).
I have a query about this procedure. It will warrant a simple answer I
assume, but I just need clarity on this. I am wondering how, for
example,
both the MR applications and the Pig programs will react if there
are no
specifications for the number of Map or Reduce jobs. If, let's say,
I were a
programmer writing some Pig scripts where I do not know the skew of
the
data, my first execution of the Pig script would be done without any
specification of #Mappers or #Reducers. Is it not a more natural
examination
of Pig vs MR apps where both Pig and the MR app have to decide these
details
for themselves? So my question is: Why is it a fundamental
requirement that
the Pig script and the associated MR app be given figures for initial
Map/Reduce tasks?
You as a Pig Latin script writer never control parallelism of the
map. That is controlled by Hadoop's InputFormat class. The vast
majority of MR programs written in Java use FileInputFormat, so most
Java MR programmers don't directly control it either. FileInputFormat
by default assigns one HDFS block to one map.
For reducers, if the script does not specify the level of parallelism
for a given operation then Pig tells Hadoop to use the cluster
default. Out of the box the cluster default is 1.
In general the places where Pig beats MR is due to better
algorithms. The
MR code was written assuming a basic level of MR coding and database
knowledge. So for example, the order by queries, the MR code
achieves a
total order by having a single reducer at the end. Pig has a much
more
sophisticated system where it samples the data, determines a
distribution,
and then uses multiple reducers while maintaining a total order.
So for
large data sets Pig will beat MR for these particular tests.
Sounds very elegant, a really neat solution to skewed data. Is there
some
documentation of this process, as I'd like to include that
methodology in my
report. And then display data results like: "skewed data / exeution
time",
where trend lines for Pig, Hive and MR apps are shown. It would be
nice to
show that, as skew of data increases, Pig overtakes the associative
MR app
for execution performance.
Functional specs for Pig are linked off of http://wiki.apache.org/
pig/ There is a spec there for how we handle skew in joins. I don't
see one on handling skew in order by.
Alan.