Excellent, that did the trick. For reference, I did:
export PIG_OPTS="$PIG_OPTS -Dmapred.max.split.size=1000000" Thanks for your help. - Charles On Tue, Dec 14, 2010 at 11:59 PM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote: > Try > > set mapred.max.split.size $desired_split_size > > -D > > On Tue, Dec 14, 2010 at 8:10 PM, Charles W <cw2...@gmail.com> wrote: > > Hi, > > > > I have a question about map parallelism in Pig. > > > > I am using Pig to stream a file through a Python script that performs > some > > computationally expensive transforms. This process is assigned to a > single > > map task that can take a very long time if it happens to execute on one > of > > the weaker nodes in the cluster. I am wondering how I can force the map > task > > to be spread across a number of nodes. > > > > From reading > > http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+PARALLEL+Clause, > I > > see that the parallelism of maps is "determined by the input file, one > map > > for each HDFS block." > > > > The file I am operating on is 40 MB; the block size is 64 MB, so > presumably > > the file is stored in a single HDFS block. The replication factor for the > > file is 3, and the DFS web UI verifies this. > > > > My question is: Is there anything I can do to increase the parallelism of > > the map task? Is it the case that the replication factor being 3 does not > > influence how many map tasks can be performed simultaneously? Should I > use a > > smaller HDFS block size? > > > > I am using Hadoop 0.20.2, Pig 0.7.0. > > > > Thanks, > > - Charles > > >