Hi Rohit, thank you for your reply. As for the second assumption, could you kindly further enlighten me a bit, please?
Thank you. On Tue, Apr 3, 2012 at 12:50 PM, Rohit Kelkar <rohitkel...@gmail.com> wrote: > Your idea in first paragraph is correct. To speed up things you can > also explore the possibility of using a Combiner. For ex. for > computing the sum set the combiner to be the same class as your > reducer. For calculating variance write a combiner class that would > output (xi - mu)^2 and in the reducer code you could take the sqrt. > > Your second assumption that number of reducers = number of variables > is not right. > > - Rohit Kelkar > > On Tue, Apr 3, 2012 at 10:10 AM, Fang Xin <nusfang...@gmail.com> wrote: >> Hi, >> >> I have a spreadsheet where each column contains values for one >> variable. and I need to calculate sum, variance, etc for each column. >> For my understanding, mapper and reducer work for <key, value> pair, >> can anyone kindly enlighten me how to abstract this problem? >> >> Maybe for the mapper, let it read each line, set variable name/number >> as "key", and corresponding value as "value". >> Then when all pairs with the same "key" (i.e. they belong to same >> variable) be passed to a reducer, reducer can do the calculation, and >> output to file. >> is this idea correct? can anyone kindly give some comment? >> >> Besides, in this method, the number of reducers will be determined by >> the number of variables I have. >> What happen if variable number is limited, and for each variable, the >> number of entries is far much bigger than the total number of >> variables, then execution time for each reducer can be comparatively >> long. >> Any way to make use of more hardware resource, and create more >> reducers to run in parallel? >> >> Best regards, >> Xin