Hi,

I have a question about map parallelism in Pig.

I am using Pig to stream a file through a Python script that performs some
computationally expensive transforms. This process is assigned to a single
map task that can take a very long time if it happens to execute on one of
the weaker nodes in the cluster. I am wondering how I can force the map task
to be spread across a number of nodes.

>From reading
http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+PARALLEL+Clause, I
see that the parallelism of maps is "determined by the input file, one map
for each HDFS block."

The file I am operating on is 40 MB; the block size is 64 MB, so presumably
the file is stored in a single HDFS block. The replication factor for the
file is 3, and the DFS web UI verifies this.

My question is: Is there anything I can do to increase the parallelism of
the map task? Is it the case that the replication factor being 3 does not
influence how many map tasks can be performed simultaneously? Should I use a
smaller HDFS block size?

I am using Hadoop 0.20.2, Pig 0.7.0.

Thanks,
- Charles

Reply via email to