That's a good call, thanks Mridul. Something reproducible like taking a hash of 
a tuple field is much better.

As for the concern about having to move all the data -- until hdfs allows 
multiple writers to a single file (not on the roadmap afaik), there isn't a 
good way to have multiple mappers write a single file. In fact even if multiple 
writers were possible you'd get in trouble if mappers failed...
 

-----Original Message-----
From: "Mridul Muralidharan" <mrid...@yahoo-inc.com>
To: "user@pig.apache.org" <user@pig.apache.org>
Cc: "Dmitriy Ryaboy" <dvrya...@gmail.com>; "Xiaomeng Wan" <shawn...@gmail.com>
Sent: 4/2/2011 6:57 AM
Subject: Re: store less files


Using rand() as group key, in general, is a pretty bad idea in case of 
failures.


- Mridul

On Saturday 02 April 2011 12:23 AM, Dmitriy Ryaboy wrote:
> Don't order, that's expensive.
> Just group by rand(), specify parallelism on the group by, and store the
> result of "foreach grouped generate FLATTEN(name_of_original_relation);"
>
> On Fri, Apr 1, 2011 at 11:22 AM, Xiaomeng Wan<shawn...@gmail.com>  wrote:
>
>> Hi Jameson,
>>
>> Do you mind to add something like this:
>>
>> c = order b by $0 parallel n;
>> store c into '20110331-ab';
>>
>> you can order on anything. it will add a reduce and give you less files.
>>
>> Regards,
>> Shawn
>> On Fri, Apr 1, 2011 at 1:57 AM, Jameson Li<hovlj...@gmail.com>  wrote:
>>> Hi,
>>>
>>> When I run the below pig codes:
>>> a = load '/logs/2011-03-31';
>>> b = filter a by $1=='a' and $2=='b';
>>> store b into '20110331-ab';
>>>
>>> It runs a M/R that have thousands maps, and then create a output store
>>> director
[truncated by sender]

Reply via email to