Hello, I am running a global sort (on Pigmix input data, size 600GB) based on TotalOrderPartitioner. The best practice according to the literature points to data sampling using RandomSampler. The query succeeds but takes a very long time (7 hours) and that's because there is only one reducer (which nullifies the point of using the above classes :) ). I am trying to figure out what forces the # of reducers to be *one*, as I defined them to be* 400*. I looked into the documentation and in the code of RandomSampler, there is a requirement which says:
// Set the path to the SequenceFile storing the sorted partition keyset. It must be the case that for R reduces, there are R-1 keys in the SequenceFile. And therefore I sampled as follows: *InputSampler.Sampler<Text, Text> sampler =new InputSampler.RandomSampler<Text, Text>(0.9, 399, 444);* Looking into my _partition file I can see there is only one partition which explains the one reducer: SEQ org.apache.hadoop.io.Text!org.apache.hadoop.io.NullWritable I am wondering how come the partition file contains only one sample, though I asked for 399 samples above? Thanks for the help!! Keren -- Keren Ouaknine www.kereno.com