subject:"RE\: store less files"

RE: store less files

2011-04-02 Thread Dmitriy Ryaboy

le mappers write a single file. In fact even if multiple writers were possible you'd get in trouble if mappers failed... -Original Message- From: "Mridul Muralidharan" To: "user@pig.apache.org" Cc: "Dmitriy Ryaboy" ; "Xiaomeng Wan" Sent:

Re: store less files

2011-04-02 Thread Mridul Muralidharan

Using rand() as group key, in general, is a pretty bad idea in case of failures. - Mridul On Saturday 02 April 2011 12:23 AM, Dmitriy Ryaboy wrote: Don't order, that's expensive. Just group by rand(), specify parallelism on the group by, and store the result of "foreach grouped generate FLA

Re: store less files

2011-04-01 Thread Jameson Li

Thanks all of you. I have test that. It works well. Below is the pig codes: a = load '/logs/2011-03-31'; b = filter a by $1=='a' and $2=='b'; c = group b by RANDOM() parallel 30;/*here you can modify the parallel number, and it will generate the number of the output files.*/ d = fo

Re: store less files

2011-04-01 Thread Jameson Li

If I have many of the TB input, and I have configured the block size "128M", it will generate thousands of mappers, and generate thousands of the output files. Because too many of the files will increase the loading of the Namenode, and it also will increase the io loading in the cluster, I need to

Re: store less files

2011-04-01 Thread Dmitriy Ryaboy

Don't order, that's expensive. Just group by rand(), specify parallelism on the group by, and store the result of "foreach grouped generate FLATTEN(name_of_original_relation);" On Fri, Apr 1, 2011 at 11:22 AM, Xiaomeng Wan wrote: > Hi Jameson, > > Do you mind to add something like this: > > c =

Re: store less files

2011-04-01 Thread Xiaomeng Wan

Hi Jameson, Do you mind to add something like this: c = order b by $0 parallel n; store c into '20110331-ab'; you can order on anything. it will add a reduce and give you less files. Regards, Shawn On Fri, Apr 1, 2011 at 1:57 AM, Jameson Li wrote: > Hi, > > When I run the below pig codes: > a

Re: store less files

2011-04-01 Thread Jameson Lopp

I can't think of a simple way to accomplish that without reducing the parallelism of your M/R jobs, which of course would affect the performance of your script. Things I'd take into account: * how much data are you reading / writing with this pig script? * do you really need thousands o

RE: store less files

Re: store less files

Re: store less files

Re: store less files

Re: store less files

Re: store less files

Re: store less files

7 matches

Site Navigation

Mail list logo

Footer information