Re: cost model for MR programs

bharath vissapragada Fri, 28 Aug 2009 23:20:32 -0700

@sanjay

Thanks for your reply .. I too was thinking in the same lines .


@Jeff

I found those links very useful... iam working on similar things . Thanks
for reply! :)



On Sat, Aug 29, 2009 at 11:44 AM, Jeff Hammerbacher <[email protected]>wrote:

> Hey Bharath,
> There has been some work in the research community on predicting the
> runtime
> of Hadoop MapReduce, Pig, and Hive jobs, though the approaches are not
> always cost-based. At the University of Washington, they've worked on
> progress indicators for Pig; see
> ftp://ftp.cs.washington.edu/tr/2009/07/UW-CSE-09-07-01.PDF. At the
> University of California, Berkeley, they've worked on predicting multiple
> metrics about query progress, first in NeoView, and subsequently in Hadoop;
> see 
> http://www.cs.berkeley.edu/~archanag/publications/ICDE09.pdf<http://www.cs.berkeley.edu/%7Earchanag/publications/ICDE09.pdf>for
>  the
> NeoView results.
>
> Some preliminary design work has been done in the Hive project for
> collecting statistics on Hive tables. See
> https://issues.apache.org/jira/browse/HIVE-33. Any contributions to this
> work would be much appreciated!
>
> Regards,
> Jeff
>
> On Fri, Aug 28, 2009 at 7:37 PM, indoos <[email protected]> wrote:
>
> >
> > Hi,
> > My suggestion would be that we should not be compelling ourselves to
> > compare
> > databases with Hadoop.
> > However, here is something not probably even close to what you may
> require,
> > but might be helpful-
> > 1. Number of nodes - these are the parameters to look for -
> > - average time taken by a single Map and Reduce task (available as part
> of
> > history-analytics),
> > - Max Input file size vs block size. Lets take an example- A 6GB input
> file
> > with 64 MB block size would  ideally require ~1000 Maps. The more you
> want
> > to run these 1000 Maps in parallel, more the number of nodes. A 10 node
> > cluster with 10 Maps would have to run ~10 times in a kind of sequential
> > mode :-(
> > - ultimately it is the time vs cost factor to decide the number of nodes.
> > So
> > for this example, if a map takes at least 2 minutes, the ~minimum time
> > would
> > be 2*10=20 minutes. Less time would mean more nodes.
> > - The number of Jobs that you might decide to run at the same time would
> > also affect the number of nodes. Effectively every individual job task
> > (map/reduce) runs in a sequential kind of mode waiting in the queue for
> the
> > existing/executing map/reduce block to finish. (Off course, we have some
> > prioritization support - this does not however help to finish everything
> in
> > parallel)
> > 2. RAM - a general thumb rule is, 1 GB RAM each for Name Node, Job
> Tracker,
> > Secondary Name node on the masters side. On slave side- 1 GB RAM each for
> > task tracker and data node which leaves practically not much for good
> > computing on a commodity 8GB machine. The remaining 5-6 GB can then be
> used
> > for Map Reduce tasks. So with our example of running 10 Maps, we would
> have
> > at the most a Map using at max 400-500 MB heap. Anything beyond this
> would
> > require either the Maps to be reduced or the RAM to be increased.
> > 3. Network speed- Hadoop recommends(I think I did read it
> > somewhere-apologies if otherwise) using at least 1 GB/s networks for the
> > heavy data transfer. My experiences with 100 MB/sec in even a dev env
> have
> > been disastrous
> > 4. Hard disk- again a thumb rule- Only 1/4 memory would be effectively
> > available. So given a 4TB hard disk, effectively only 1 TB can be used
> for
> > real data with 2 TB used for replication (3-ideal replication factor) and
> 1
> > TB for temp usage
> > Regards,
> > Sanjay
> >
> >
> > bharath vissapragada-2 wrote:
> > >
> > > Hi all ,
> > >
> > > Is there any general cost model that can be used to guess the run time
> of
> > > a
> > > program (similar to Page IO/s , selectivity factors in RDBMS) in terms
> of
> > > any config aspects such as number of nodes/page IO/s etc .
> > >
> > >
> >
> > --
> > View this message in context:
> >
> http://www.nabble.com/cost-model-for-MR-programs-tp25127531p25199508.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>

Re: cost model for MR programs

Reply via email to