unless you really care about getting exact averages etc, i would suggest simply sampling the input and computing your statistics from that
--it will be a lot faster and you won't have to deal with under/overflow etc if your sample is reasonably large then your results will be pretty close to the true values Miles 2008/11/12 Joel Welling <[EMAIL PROTECTED]>: > Amar, isn't there a problem with your method in that it gets a small > result by subtracting very large numbers? Given a million inputs, won't > A and B be so much larger than the standard deviation that there aren't > enough no bits left in the floating point number to represent it? > > I just thought I should mention that, before this thread goes in an > archive somewhere and some student looks it up. > > -Joel > > On Wed, 2008-11-12 at 12:32 +0530, Amar Kamat wrote: >> some speed wrote: >> > Thanks for the response. What I am trying is to do is finding the average >> > and then the standard deviation for a very large set (say a million) of >> > numbers. The result would be used in further calculations. >> > I have got the average from the first map-reduce chain. now i need to read >> > this average as well as the set of numbers to calculate the standard >> > deviation. so one file would have the input set and the other "resultant" >> > file would have just the average. >> > Please do tell me in case there is a better way of doing things than what i >> > am doing. Any input/suggestion is appreciated.:) >> > >> > >> std_dev^2 = sum_i((Xi - Xa) ^ 2) / N; where Xa is the avg. >> Why dont you use the formula to compute it in one MR job. >> std_dev^2 = (sum_i(Xi ^ 2) - N * (Xa ^ 2) ) / N; >> = (A - N*(avg^2))/N >> >> For this your map would look like >> map (key, val) : output.collect(key^2, key); // imagine your input as >> (k,v) = (Xi, null) >> Reduce should simply sum over the keys to find out 'sum_i(Xi ^ 2)' and >> sum over the values to find out 'Xa'. You could use the close() api to >> finally dump there 2 values to a file. >> >> For example : >> input : 1,2,3,4 >> Say input is split in 2 groups [1,2] and [4,5] >> Now there will be 2 maps with output as follows >> map1 output : (1,1) (4,2) >> map2 output : (9,3) (16,4) >> >> Reducer will maintain the sum over all keys and all values >> A = sum(key i.e input squared) = 1+ 4 + 9 + 16 = 30 >> B = sum(values i.e input) = 1 + 2 + 3 + 4 = 10 >> >> With A and B you can compute the standard deviation offline. >> So avg = B / N = 10/4 = 2.5 >> Hence the std deviation would be >> sqrt( (A - N * avg^2) / N) = sqrt ((30 - 4*6.25)/4) = *1.11803399 >> >> *Using the main formula the answer is *1.11803399* >> Amar >> > >> > On Mon, Nov 10, 2008 at 4:22 AM, Amar Kamat <[EMAIL PROTECTED]> wrote: >> > >> > >> >> Amar Kamat wrote: >> >> >> >> >> >>> some speed wrote: >> >>> >> >>> >> >>>> I was wondering if it was possible to read the input for a map function >> >>>> from >> >>>> 2 different files: >> >>>> 1st file ---> user-input file from a particular location(path) >> >>>> >> >>>> >> >>> Is the input/user file sorted? If yes then you can use "map-side join" >> >>> for >> >>> >> >> performance reasons. See org.apache.hadoop.mapred.join for more details. >> >> >> >> >> >>> 2nd file=---> A resultant file (has just one <key,value> pair) from a >> >>> >> >>>> previous MapReduce job. (I am implementing a chain MapReduce function) >> >>>> >> >>>> >> >>> Can you explain in more detail the contents of 2nd file? >> >>> >> >>>> Now, for every <key,value> pair in the user-input file, I would like to >> >>>> use >> >>>> the same <key,value> pair from the 2nd file for some calculations. >> >>>> >> >>>> >> >>> Can you explain this in more detail? Can you give some abstracted example >> >>> >> >> of how file1 and file2 look like and what operation/processing you want to >> >> do? >> >> >> >> >> >> >> >>> I guess you might need to do some kind of join on the 2 files. Look at >> >>> contrib/data_join for more details. >> >>> Amar >> >>> >> >>> >> >>>> Is it possible for me to do so? Can someone guide me in the right >> >>>> direction >> >>>> please? >> >>>> >> >>>> >> >>>> Thanks! >> >>>> >> >>>> >> >>>> >> >>>> >> >>> >> > >> > > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.