Re: How to control a number of reducers in Apache Pig

Jonathan Coveney Fri, 15 Mar 2013 03:16:00 -0700

The script you posted wouldn't have any reducers, so it wouldn't matter.
It's a map only job.



2013/3/15 <tsuna...@o2.pl>

> Dear Apache Pig Users,
>
> It is easy to control a number of reducers in JOIN, GROUP, COGROUP,
> etc. statements by a general "set default_parallel $NUM" command or
> "parallel $NUM" info in the end of line.
>
> However, I am interested in controlling number of reducers in a
> foreach statement.
> The case is as follows:
> * on CDH 4.0.1. with Pig 0.9.2.
> * read one sequence file (of many equivalent files) of about 400GB,
> * proceed each element in UDF __using as many reducers as possible__
> * store the results
>
> Apache Pig script implementing this case -- which gives __only one__
> reducer -- is below:
> ------------------------------------------------
> SET default_parallel 16;
> REGISTER myjar.jar;
> input_pairs = LOAD '$input' USING
> pl.example.MySequenceFileLoader('org.apache.hadoop.io.BytesWritable',
> 'org.apache.hadoop.io.BytesWritable') as (key:chararray,
> value:bytearray);
> input_protos  = FOREACH input_pairs GENERATE
> FLATTEN(pl.example.ReadProtobuf(value));
> output_protos = FOREACH input_protos GENERATE
> FLATTEN(pl.example.XMLGenerator(*));
> STORE output_protos INTO '$output' USING PigStorage();
> ------------------------------------------------
>
> As far as I know "set mapred.reduce.tasks 5" can only limit a max
> number of reducers
>
> Could you give me some advice? Am I missing something?
>

Re: How to control a number of reducers in Apache Pig

Reply via email to