le mappers write a single file. In fact even if multiple
writers were possible you'd get in trouble if mappers failed...
-Original Message-
From: "Mridul Muralidharan"
To: "user@pig.apache.org"
Cc: "Dmitriy Ryaboy" ; "Xiaomeng Wan"
Sent:
Using rand() as group key, in general, is a pretty bad idea in case of
failures.
- Mridul
On Saturday 02 April 2011 12:23 AM, Dmitriy Ryaboy wrote:
Don't order, that's expensive.
Just group by rand(), specify parallelism on the group by, and store the
result of "foreach grouped generate FLA
Thanks all of you.
I have test that. It works well.
Below is the pig codes:
a = load '/logs/2011-03-31';
b = filter a by $1=='a' and $2=='b';
c = group b by RANDOM() parallel 30;/*here you can modify the parallel
number, and it will generate the number of the output files.*/
d = fo
If I have many of the TB input, and I have configured the block size "128M",
it will generate thousands of mappers, and generate thousands of the output
files.
Because too many of the files will increase the loading of the Namenode, and
it also will increase the io loading in the cluster, I need to
Don't order, that's expensive.
Just group by rand(), specify parallelism on the group by, and store the
result of "foreach grouped generate FLATTEN(name_of_original_relation);"
On Fri, Apr 1, 2011 at 11:22 AM, Xiaomeng Wan wrote:
> Hi Jameson,
>
> Do you mind to add something like this:
>
> c =
Hi Jameson,
Do you mind to add something like this:
c = order b by $0 parallel n;
store c into '20110331-ab';
you can order on anything. it will add a reduce and give you less files.
Regards,
Shawn
On Fri, Apr 1, 2011 at 1:57 AM, Jameson Li wrote:
> Hi,
>
> When I run the below pig codes:
> a
I can't think of a simple way to accomplish that without reducing the parallelism of your M/R jobs,
which of course would affect the performance of your script.
Things I'd take into account:
* how much data are you reading / writing with this pig script?
* do you really need thousands o