RE: store less files

2011-04-02 Thread Dmitriy Ryaboy
le mappers write a single file. In fact even if multiple writers were possible you'd get in trouble if mappers failed... -Original Message- From: "Mridul Muralidharan" To: "user@pig.apache.org" Cc: "Dmitriy Ryaboy" ; "Xiaomeng Wan" Sent:

Re: store less files

2011-04-02 Thread Mridul Muralidharan
Using rand() as group key, in general, is a pretty bad idea in case of failures. - Mridul On Saturday 02 April 2011 12:23 AM, Dmitriy Ryaboy wrote: Don't order, that's expensive. Just group by rand(), specify parallelism on the group by, and store the result of "foreach grouped generate FLA

Re: store less files

2011-04-01 Thread Jameson Li
Thanks all of you. I have test that. It works well. Below is the pig codes: a = load '/logs/2011-03-31'; b = filter a by $1=='a' and $2=='b'; c = group b by RANDOM() parallel 30;/*here you can modify the parallel number, and it will generate the number of the output files.*/ d = fo

Re: store less files

2011-04-01 Thread Jameson Li
If I have many of the TB input, and I have configured the block size "128M", it will generate thousands of mappers, and generate thousands of the output files. Because too many of the files will increase the loading of the Namenode, and it also will increase the io loading in the cluster, I need to

Re: store less files

2011-04-01 Thread Dmitriy Ryaboy
Don't order, that's expensive. Just group by rand(), specify parallelism on the group by, and store the result of "foreach grouped generate FLATTEN(name_of_original_relation);" On Fri, Apr 1, 2011 at 11:22 AM, Xiaomeng Wan wrote: > Hi Jameson, > > Do you mind to add something like this: > > c =

Re: store less files

2011-04-01 Thread Xiaomeng Wan
Hi Jameson, Do you mind to add something like this: c = order b by $0 parallel n; store c into '20110331-ab'; you can order on anything. it will add a reduce and give you less files. Regards, Shawn On Fri, Apr 1, 2011 at 1:57 AM, Jameson Li wrote: > Hi, > > When I run the below pig codes: > a

Re: store less files

2011-04-01 Thread Jameson Lopp
I can't think of a simple way to accomplish that without reducing the parallelism of your M/R jobs, which of course would affect the performance of your script. Things I'd take into account: * how much data are you reading / writing with this pig script? * do you really need thousands o