That's a good call, thanks Mridul. Something reproducible like taking a hash of a tuple field is much better.
As for the concern about having to move all the data -- until hdfs allows multiple writers to a single file (not on the roadmap afaik), there isn't a good way to have multiple mappers write a single file. In fact even if multiple writers were possible you'd get in trouble if mappers failed... -----Original Message----- From: "Mridul Muralidharan" <mrid...@yahoo-inc.com> To: "user@pig.apache.org" <user@pig.apache.org> Cc: "Dmitriy Ryaboy" <dvrya...@gmail.com>; "Xiaomeng Wan" <shawn...@gmail.com> Sent: 4/2/2011 6:57 AM Subject: Re: store less files Using rand() as group key, in general, is a pretty bad idea in case of failures. - Mridul On Saturday 02 April 2011 12:23 AM, Dmitriy Ryaboy wrote: > Don't order, that's expensive. > Just group by rand(), specify parallelism on the group by, and store the > result of "foreach grouped generate FLATTEN(name_of_original_relation);" > > On Fri, Apr 1, 2011 at 11:22 AM, Xiaomeng Wan<shawn...@gmail.com> wrote: > >> Hi Jameson, >> >> Do you mind to add something like this: >> >> c = order b by $0 parallel n; >> store c into '20110331-ab'; >> >> you can order on anything. it will add a reduce and give you less files. >> >> Regards, >> Shawn >> On Fri, Apr 1, 2011 at 1:57 AM, Jameson Li<hovlj...@gmail.com> wrote: >>> Hi, >>> >>> When I run the below pig codes: >>> a = load '/logs/2011-03-31'; >>> b = filter a by $1=='a' and $2=='b'; >>> store b into '20110331-ab'; >>> >>> It runs a M/R that have thousands maps, and then create a output store >>> director [truncated by sender]