On Mon, Mar 1, 2010 at 4:13 PM, Darren Govoni <dar...@ontrenet.com> wrote: > Theoretically. O(n) > > All other variables being equal across all nodes > should...mmmmm.....reduce to n. > > That part that really can't be measured is the cost of Hadoop's > bookkeeping chores as the data set grows since some things in Hadoop > involve synchronous/serial behavior. > > On Mon, 2010-03-01 at 12:27 -0500, Edward Capriolo wrote: > >> A previous post to core-user mentioned some formula to determine job >> time. I was wondering if anyone out there is trying to tackle >> designing a formula that can calculate the job run time of a >> map/reduce program. Obviously there are many variables here including >> but not limited to Disk Speed ,Network Speed, Processor Speed, input >> data, many constants , data-skew, map complexity, reduce complexity, # >> of nodes...... >> >> As an intellectual challenge has anyone starting trying to write a >> formula that can take into account all these factors and try to >> actually predict a job time in minutes/hours? > > >
Understood, BIG-0 notation is really not what I am looking for. Given all variables are the same, a hadoop job on a finite set of data should run for a finite time. There are parts of the process that run linear and parts that run in parallel, but there must be a way to express how long a job actually takes (although admittedly it is very involved to figure out)