RE: store less files

2011-04-02 Thread Dmitriy Ryaboy
That's a good call, thanks Mridul. Something reproducible like taking a hash of a tuple field is much better. As for the concern about having to move all the data -- until hdfs allows multiple writers to a single file (not on the roadmap afaik), there isn't a good way to have multiple mappers w

Re: store less files

2011-04-02 Thread Mridul Muralidharan
Using rand() as group key, in general, is a pretty bad idea in case of failures. - Mridul On Saturday 02 April 2011 12:23 AM, Dmitriy Ryaboy wrote: Don't order, that's expensive. Just group by rand(), specify parallelism on the group by, and store the result of "foreach grouped generate FLA