Hi, I have a question about map parallelism in Pig.
I am using Pig to stream a file through a Python script that performs some computationally expensive transforms. This process is assigned to a single map task that can take a very long time if it happens to execute on one of the weaker nodes in the cluster. I am wondering how I can force the map task to be spread across a number of nodes. >From reading http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+PARALLEL+Clause, I see that the parallelism of maps is "determined by the input file, one map for each HDFS block." The file I am operating on is 40 MB; the block size is 64 MB, so presumably the file is stored in a single HDFS block. The replication factor for the file is 3, and the DFS web UI verifies this. My question is: Is there anything I can do to increase the parallelism of the map task? Is it the case that the replication factor being 3 does not influence how many map tasks can be performed simultaneously? Should I use a smaller HDFS block size? I am using Hadoop 0.20.2, Pig 0.7.0. Thanks, - Charles