On Oct 31, 2009, at 11:22 AM, Rob Stewart wrote:

<snip>
Map and reduce parallelism are controlled differently in Hadoop.  Map
parallelism is controlled by the InputSplit. IS determines how many maps to start and which file blocks to assign to which maps. In the case of PigMix, both the MR Java code and the Pig code use some subclass of FileInputFormat, so the map parallelism is the same in both tests. I do not know for sure,
but I believe Hive also uses FileInputFormat.

Reduce parallelism is set explicitly as part of the job configuration. In MapReduce this is done through the Java API. In Pig it is done through though the PARALLEL command. In PigMix, we set parallelism for both the
same (40 I believe for this data size).


I have a query about this procedure. It will warrant a simple answer I
assume, but I just need clarity on this. I am wondering how, for example, both the MR applications and the Pig programs will react if there are no specifications for the number of Map or Reduce jobs. If, let's say, I were a programmer writing some Pig scripts where I do not know the skew of the
data, my first execution of the Pig script would be done without any
specification of #Mappers or #Reducers. Is it not a more natural examination of Pig vs MR apps where both Pig and the MR app have to decide these details for themselves? So my question is: Why is it a fundamental requirement that
the Pig script and the associated MR app be given figures for initial
Map/Reduce tasks?

You as a Pig Latin script writer never control parallelism of the map. That is controlled by Hadoop's InputFormat class. The vast majority of MR programs written in Java use FileInputFormat, so most Java MR programmers don't directly control it either. FileInputFormat by default assigns one HDFS block to one map.

For reducers, if the script does not specify the level of parallelism for a given operation then Pig tells Hadoop to use the cluster default. Out of the box the cluster default is 1.




In general the places where Pig beats MR is due to better algorithms. The
MR code was written assuming a basic level of MR coding and database
knowledge. So for example, the order by queries, the MR code achieves a total order by having a single reducer at the end. Pig has a much more sophisticated system where it samples the data, determines a distribution, and then uses multiple reducers while maintaining a total order. So for
large data sets Pig will beat MR for these particular tests.


Sounds very elegant, a really neat solution to skewed data. Is there some documentation of this process, as I'd like to include that methodology in my report. And then display data results like: "skewed data / exeution time", where trend lines for Pig, Hive and MR apps are shown. It would be nice to show that, as skew of data increases, Pig overtakes the associative MR app
for execution performance.

Functional specs for Pig are linked off of http://wiki.apache.org/ pig/ There is a spec there for how we handle skew in joins. I don't see one on handling skew in order by.

Alan.

Reply via email to