Re: Follow Up Questions: PigMix, DataGenerator etc...

Alan Gates Mon, 02 Nov 2009 15:52:00 -0800


On Oct 31, 2009, at 11:22 AM, Rob Stewart wrote:


<snip>

Map and reduce parallelism are controlled differently in Hadoop.  Map
parallelism is controlled by the InputSplit. IS determines howmany maps tostart and which file blocks to assign to which maps. In the caseof PigMix,both the MR Java code and the Pig code use some subclass ofFileInputFormat,so the map parallelism is the same in both tests. I do not knowfor sure,
but I believe Hive also uses FileInputFormat.
Reduce parallelism is set explicitly as part of the jobconfiguration. InMapReduce this is done through the Java API. In Pig it is donethroughthough the PARALLEL command. In PigMix, we set parallelism forboth the
same (40 I believe for this data size).
I have a query about this procedure. It will warrant a simple answer I
assume, but I just need clarity on this. I am wondering how, forexample,both the MR applications and the Pig programs will react if thereare nospecifications for the number of Map or Reduce jobs. If, let's say,I were aprogrammer writing some Pig scripts where I do not know the skew ofthe
data, my first execution of the Pig script would be done without any
specification of #Mappers or #Reducers. Is it not a more naturalexaminationof Pig vs MR apps where both Pig and the MR app have to decide thesedetailsfor themselves? So my question is: Why is it a fundamentalrequirement that
the Pig script and the associated MR app be given figures for initial
Map/Reduce tasks?

You as a Pig Latin script writer never control parallelism of themap. That is controlled by Hadoop's InputFormat class. The vastmajority of MR programs written in Java use FileInputFormat, so mostJava MR programmers don't directly control it either. FileInputFormatby default assigns one HDFS block to one map.

For reducers, if the script does not specify the level of parallelismfor a given operation then Pig tells Hadoop to use the clusterdefault. Out of the box the cluster default is 1.

In general the places where Pig beats MR is due to betteralgorithms. The
MR code was written assuming a basic level of MR coding and database
knowledge. So for example, the order by queries, the MR codeachieves atotal order by having a single reducer at the end. Pig has a muchmoresophisticated system where it samples the data, determines adistribution,and then uses multiple reducers while maintaining a total order.So for
large data sets Pig will beat MR for these particular tests.
Sounds very elegant, a really neat solution to skewed data. Is theresomedocumentation of this process, as I'd like to include thatmethodology in myreport. And then display data results like: "skewed data / exeutiontime",where trend lines for Pig, Hive and MR apps are shown. It would benice toshow that, as skew of data increases, Pig overtakes the associativeMR app
for execution performance.

Functional specs for Pig are linked off of http://wiki.apache.org/pig/ There is a spec there for how we handle skew in joins. I don'tsee one on handling skew in order by.


Alan.

Re: Follow Up Questions: PigMix, DataGenerator etc...

Reply via email to