Your mileage may vary.
If you are grouping records so that all the results are equal size, then
uniquing the keys before sampling is good. On the other hand, if you have
larger data items for repeated keys due to the grouping, then giving fewer
big keys to some regionservers is good. You can als
last key. This last can probably best be done as a conventional script
> after you have knocked down the data to small size.
Yeah, the split command comes to mind.
> Note that most of your joins can just go away since all you want are the
> keys.
Sure, but you need to make sure that you're sti
It should be pretty easy to down-sample the data to have no more than
1000-10,000 keys. Sort those and take every n-th key omitting the first and
last key. This last can probably best be done as a conventional script
after you have knocked down the data to small size.
Note that most of your join
The output is a text file. I'm sure I could write something using the
HDFS Java API to pull the first line of each file, but I'm looking for
an approach to extract these keys all via MR, if possible.
On Tue, Mar 29, 2011 at 2:33 PM, Ted Yu wrote:
> I am not very familiar with Pig.
> Assuming red
I am not very familiar with Pig.
Assuming reducer output file is SequenceFile, steps 2 and 3 can be
automated.
On Tue, Mar 29, 2011 at 2:15 PM, Bill Graham wrote:
> I've been thinking about this topic lately so I'll fork from another
> discussion to ask if anyone has a good approach to determini
I've been thinking about this topic lately so I'll fork from another
discussion to ask if anyone has a good approach to determining keys
for pre-splitting from a known dataset. We have a key scenario similar
to what Ted describes below.
We periodically run MR jobs to transform and bulk load data f