@sanjay Thanks for your reply .. I too was thinking in the same lines .
@Jeff I found those links very useful... iam working on similar things . Thanks for reply! :) On Sat, Aug 29, 2009 at 11:44 AM, Jeff Hammerbacher <[email protected]>wrote: > Hey Bharath, > There has been some work in the research community on predicting the > runtime > of Hadoop MapReduce, Pig, and Hive jobs, though the approaches are not > always cost-based. At the University of Washington, they've worked on > progress indicators for Pig; see > ftp://ftp.cs.washington.edu/tr/2009/07/UW-CSE-09-07-01.PDF. At the > University of California, Berkeley, they've worked on predicting multiple > metrics about query progress, first in NeoView, and subsequently in Hadoop; > see > http://www.cs.berkeley.edu/~archanag/publications/ICDE09.pdf<http://www.cs.berkeley.edu/%7Earchanag/publications/ICDE09.pdf>for > the > NeoView results. > > Some preliminary design work has been done in the Hive project for > collecting statistics on Hive tables. See > https://issues.apache.org/jira/browse/HIVE-33. Any contributions to this > work would be much appreciated! > > Regards, > Jeff > > On Fri, Aug 28, 2009 at 7:37 PM, indoos <[email protected]> wrote: > > > > > Hi, > > My suggestion would be that we should not be compelling ourselves to > > compare > > databases with Hadoop. > > However, here is something not probably even close to what you may > require, > > but might be helpful- > > 1. Number of nodes - these are the parameters to look for - > > - average time taken by a single Map and Reduce task (available as part > of > > history-analytics), > > - Max Input file size vs block size. Lets take an example- A 6GB input > file > > with 64 MB block size would ideally require ~1000 Maps. The more you > want > > to run these 1000 Maps in parallel, more the number of nodes. A 10 node > > cluster with 10 Maps would have to run ~10 times in a kind of sequential > > mode :-( > > - ultimately it is the time vs cost factor to decide the number of nodes. > > So > > for this example, if a map takes at least 2 minutes, the ~minimum time > > would > > be 2*10=20 minutes. Less time would mean more nodes. > > - The number of Jobs that you might decide to run at the same time would > > also affect the number of nodes. Effectively every individual job task > > (map/reduce) runs in a sequential kind of mode waiting in the queue for > the > > existing/executing map/reduce block to finish. (Off course, we have some > > prioritization support - this does not however help to finish everything > in > > parallel) > > 2. RAM - a general thumb rule is, 1 GB RAM each for Name Node, Job > Tracker, > > Secondary Name node on the masters side. On slave side- 1 GB RAM each for > > task tracker and data node which leaves practically not much for good > > computing on a commodity 8GB machine. The remaining 5-6 GB can then be > used > > for Map Reduce tasks. So with our example of running 10 Maps, we would > have > > at the most a Map using at max 400-500 MB heap. Anything beyond this > would > > require either the Maps to be reduced or the RAM to be increased. > > 3. Network speed- Hadoop recommends(I think I did read it > > somewhere-apologies if otherwise) using at least 1 GB/s networks for the > > heavy data transfer. My experiences with 100 MB/sec in even a dev env > have > > been disastrous > > 4. Hard disk- again a thumb rule- Only 1/4 memory would be effectively > > available. So given a 4TB hard disk, effectively only 1 TB can be used > for > > real data with 2 TB used for replication (3-ideal replication factor) and > 1 > > TB for temp usage > > Regards, > > Sanjay > > > > > > bharath vissapragada-2 wrote: > > > > > > Hi all , > > > > > > Is there any general cost model that can be used to guess the run time > of > > > a > > > program (similar to Page IO/s , selectivity factors in RDBMS) in terms > of > > > any config aspects such as number of nodes/page IO/s etc . > > > > > > > > > > -- > > View this message in context: > > > http://www.nabble.com/cost-model-for-MR-programs-tp25127531p25199508.html > > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > > > >
