Re: Tips on pre-splitting

2011-03-29 Thread Ted Dunning
Your mileage may vary. If you are grouping records so that all the results are equal size, then uniquing the keys before sampling is good. On the other hand, if you have larger data items for repeated keys due to the grouping, then giving fewer big keys to some regionservers is good. You can als

Re: Tips on pre-splitting

2011-03-29 Thread Bill Graham
last key.  This last can probably best be done as a conventional script > after you have knocked down the data to small size. Yeah, the split command comes to mind. > Note that most of your joins can just go away since all you want are the > keys. Sure, but you need to make sure that you're sti

Re: Tips on pre-splitting

2011-03-29 Thread Ted Dunning
It should be pretty easy to down-sample the data to have no more than 1000-10,000 keys. Sort those and take every n-th key omitting the first and last key. This last can probably best be done as a conventional script after you have knocked down the data to small size. Note that most of your join

Re: Tips on pre-splitting

2011-03-29 Thread Bill Graham
The output is a text file. I'm sure I could write something using the HDFS Java API to pull the first line of each file, but I'm looking for an approach to extract these keys all via MR, if possible. On Tue, Mar 29, 2011 at 2:33 PM, Ted Yu wrote: > I am not very familiar with Pig. > Assuming red

Re: Tips on pre-splitting

2011-03-29 Thread Ted Yu
I am not very familiar with Pig. Assuming reducer output file is SequenceFile, steps 2 and 3 can be automated. On Tue, Mar 29, 2011 at 2:15 PM, Bill Graham wrote: > I've been thinking about this topic lately so I'll fork from another > discussion to ask if anyone has a good approach to determini

Tips on pre-splitting

2011-03-29 Thread Bill Graham
I've been thinking about this topic lately so I'll fork from another discussion to ask if anyone has a good approach to determining keys for pre-splitting from a known dataset. We have a key scenario similar to what Ted describes below. We periodically run MR jobs to transform and bulk load data f