Hey Bharath, There has been some work in the research community on predicting the runtime of Hadoop MapReduce, Pig, and Hive jobs, though the approaches are not always cost-based. At the University of Washington, they've worked on progress indicators for Pig; see ftp://ftp.cs.washington.edu/tr/2009/07/UW-CSE-09-07-01.PDF. At the University of California, Berkeley, they've worked on predicting multiple metrics about query progress, first in NeoView, and subsequently in Hadoop; see http://www.cs.berkeley.edu/~archanag/publications/ICDE09.pdf for the NeoView results.
Some preliminary design work has been done in the Hive project for collecting statistics on Hive tables. See https://issues.apache.org/jira/browse/HIVE-33. Any contributions to this work would be much appreciated! Regards, Jeff On Fri, Aug 28, 2009 at 7:37 PM, indoos <[email protected]> wrote: > > Hi, > My suggestion would be that we should not be compelling ourselves to > compare > databases with Hadoop. > However, here is something not probably even close to what you may require, > but might be helpful- > 1. Number of nodes - these are the parameters to look for - > - average time taken by a single Map and Reduce task (available as part of > history-analytics), > - Max Input file size vs block size. Lets take an example- A 6GB input file > with 64 MB block size would ideally require ~1000 Maps. The more you want > to run these 1000 Maps in parallel, more the number of nodes. A 10 node > cluster with 10 Maps would have to run ~10 times in a kind of sequential > mode :-( > - ultimately it is the time vs cost factor to decide the number of nodes. > So > for this example, if a map takes at least 2 minutes, the ~minimum time > would > be 2*10=20 minutes. Less time would mean more nodes. > - The number of Jobs that you might decide to run at the same time would > also affect the number of nodes. Effectively every individual job task > (map/reduce) runs in a sequential kind of mode waiting in the queue for the > existing/executing map/reduce block to finish. (Off course, we have some > prioritization support - this does not however help to finish everything in > parallel) > 2. RAM - a general thumb rule is, 1 GB RAM each for Name Node, Job Tracker, > Secondary Name node on the masters side. On slave side- 1 GB RAM each for > task tracker and data node which leaves practically not much for good > computing on a commodity 8GB machine. The remaining 5-6 GB can then be used > for Map Reduce tasks. So with our example of running 10 Maps, we would have > at the most a Map using at max 400-500 MB heap. Anything beyond this would > require either the Maps to be reduced or the RAM to be increased. > 3. Network speed- Hadoop recommends(I think I did read it > somewhere-apologies if otherwise) using at least 1 GB/s networks for the > heavy data transfer. My experiences with 100 MB/sec in even a dev env have > been disastrous > 4. Hard disk- again a thumb rule- Only 1/4 memory would be effectively > available. So given a 4TB hard disk, effectively only 1 TB can be used for > real data with 2 TB used for replication (3-ideal replication factor) and 1 > TB for temp usage > Regards, > Sanjay > > > bharath vissapragada-2 wrote: > > > > Hi all , > > > > Is there any general cost model that can be used to guess the run time of > > a > > program (similar to Page IO/s , selectivity factors in RDBMS) in terms of > > any config aspects such as number of nodes/page IO/s etc . > > > > > > -- > View this message in context: > http://www.nabble.com/cost-model-for-MR-programs-tp25127531p25199508.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > >
