Excellent, that did the trick.

For reference, I did:

export PIG_OPTS="$PIG_OPTS -Dmapred.max.split.size=1000000"

Thanks for your help.

- Charles

On Tue, Dec 14, 2010 at 11:59 PM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote:

> Try
>
> set mapred.max.split.size $desired_split_size
>
> -D
>
> On Tue, Dec 14, 2010 at 8:10 PM, Charles W <cw2...@gmail.com> wrote:
> > Hi,
> >
> > I have a question about map parallelism in Pig.
> >
> > I am using Pig to stream a file through a Python script that performs
> some
> > computationally expensive transforms. This process is assigned to a
> single
> > map task that can take a very long time if it happens to execute on one
> of
> > the weaker nodes in the cluster. I am wondering how I can force the map
> task
> > to be spread across a number of nodes.
> >
> > From reading
> > http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+PARALLEL+Clause,
> I
> > see that the parallelism of maps is "determined by the input file, one
> map
> > for each HDFS block."
> >
> > The file I am operating on is 40 MB; the block size is 64 MB, so
> presumably
> > the file is stored in a single HDFS block. The replication factor for the
> > file is 3, and the DFS web UI verifies this.
> >
> > My question is: Is there anything I can do to increase the parallelism of
> > the map task? Is it the case that the replication factor being 3 does not
> > influence how many map tasks can be performed simultaneously? Should I
> use a
> > smaller HDFS block size?
> >
> > I am using Hadoop 0.20.2, Pig 0.7.0.
> >
> > Thanks,
> > - Charles
> >
>

Reply via email to