So here are my questions:
(1) is there a  jobconf hint to limit the number of records in kviter?
I can (and have) made a fix to my code that processes the values in a
combiner step in batches (i.e takes N at a go,processes that and
repeat), but was wondering if i could just set an option.

Approximately and indirectly, yes. You can limit the amount of memory allocated to storing serialized records in memory (io.sort.mb) and the percentage of that space reserved for storing record metadata (io.sort.record.percent, IIRC). That can be used to limit the number of records in each spill, though you may also need to disable the combiner during the merge, where you may run into the same problem.

You're almost certainly better off designing your combiner to scale well (as you have), since you'll hit this in the reduce, too.

Since this occurred in the MapContext, changing the number of reducers
wont help.
(2) How does changing the number of reducers help at all? I have 7
machines, so I feel 11 (a prime close to 7, why a prime?) is good
enough (some machines are 16GB others 32GB)

Your combiner will look at all the records for a partition and only those records in a partition. If your partitioner distributes your records evenly in a particular spill, then increasing the total number of partitions will decrease the number of records your combiner considers in each call. For most partitioners, whether the number of reducers is prime should be irrelevant. -C

Reply via email to