Your idea in first paragraph is correct. To speed up things you can also explore the possibility of using a Combiner. For ex. for computing the sum set the combiner to be the same class as your reducer. For calculating variance write a combiner class that would output (xi - mu)^2 and in the reducer code you could take the sqrt.
Your second assumption that number of reducers = number of variables is not right. - Rohit Kelkar On Tue, Apr 3, 2012 at 10:10 AM, Fang Xin <nusfang...@gmail.com> wrote: > Hi, > > I have a spreadsheet where each column contains values for one > variable. and I need to calculate sum, variance, etc for each column. > For my understanding, mapper and reducer work for <key, value> pair, > can anyone kindly enlighten me how to abstract this problem? > > Maybe for the mapper, let it read each line, set variable name/number > as "key", and corresponding value as "value". > Then when all pairs with the same "key" (i.e. they belong to same > variable) be passed to a reducer, reducer can do the calculation, and > output to file. > is this idea correct? can anyone kindly give some comment? > > Besides, in this method, the number of reducers will be determined by > the number of variables I have. > What happen if variable number is limited, and for each variable, the > number of entries is far much bigger than the total number of > variables, then execution time for each reducer can be comparatively > long. > Any way to make use of more hardware resource, and create more > reducers to run in parallel? > > Best regards, > Xin