That's a good call, thanks Mridul. Something reproducible like taking a hash of
a tuple field is much better.
As for the concern about having to move all the data -- until hdfs allows
multiple writers to a single file (not on the roadmap afaik), there isn't a
good way to have multiple mappers w
Using rand() as group key, in general, is a pretty bad idea in case of
failures.
- Mridul
On Saturday 02 April 2011 12:23 AM, Dmitriy Ryaboy wrote:
Don't order, that's expensive.
Just group by rand(), specify parallelism on the group by, and store the
result of "foreach grouped generate FLA