Re: Sailfish

Sriram Rao Thu, 10 May 2012 23:02:06 -0700

Srivas,

Sailfish is builds upon record append (a feature not present in HDFS).


The software that is currently released is based on Hadoop-0.20.2.  You use
the Sailfish version of Hadoop-0.20.2, KFS for the intermediate data, and
then HDFS (or KFS) for storing the job/input.  Since the changes are all in
the handling of map output/reduce input, it is transparent to existing jobs.

What is being proposed below is to bolt all the starting/stopping of the
related deamons into YARN as a first step.  There are other approaches that
are possible, which have a similar effect.

Hope this helps.

Sriram


On Thu, May 10, 2012 at 10:50 PM, M. C. Srivas <mcsri...@gmail.com> wrote:

> Sriram,   Sailfish depends on append. I just noticed the HDFS disabled
> append. How does one use this with Hadoop?
>
>
> On Wed, May 9, 2012 at 9:00 AM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com
> > wrote:
>
> > Hi Sriram,
> >
> > >> The I-file concept could possibly be implemented here in a fairly self
> > contained way. One
> > >> could even colocate/embed a KFS filesystem with such an alternate
> > >> shuffle, like how MR task temporary space is usually colocated with
> > >> HDFS storage.
> >
> > >  Exactly.
> >
> > >> Does this seem reasonable in any way?
> >
> > > Great. Where do go from here?  How do we get a colloborative effort
> > going?
> >
> >
> > Sounds like a JIRA issue should be opened, the approach briefly
> described,
> > and the first implementation attempt made.  Then iterate.
> >
> > I look forward to seeing this! :)
> >
> > Otis
> > --
> >
> > Performance Monitoring for Solr / ElasticSearch / HBase -
> > http://sematext.com/spm
> >
> >
> >
> > >________________________________
> > > From: Sriram Rao <srirams...@gmail.com>
> > >To: common-dev@hadoop.apache.org
> > >Sent: Tuesday, May 8, 2012 6:48 PM
> > >Subject: Re: Sailfish
> > >
> > >Dear Andy,
> > >
> > >> From: Andrew Purtell <apurt...@apache.org>
> > >> ...
> > >
> > >> Do you intend this to be a joint project with the Hadoop community or
> > >> a technology competitor?
> > >
> > >As I had said in my email, we are looking for folks to colloborate
> > >with us to help get us integrated with Hadoop.  So, to be explicitly
> > >clear, we are intending for this to be a joint project with the
> > >community.
> > >
> > >> Regrettably, KFS is not a "drop in replacement" for HDFS.
> > >> Hypothetically: I have several petabytes of data in an existing HDFS
> > >> deployment, which is the norm, and a continuous MapReduce workflow.
> > >> How do you propose I, practically, migrate to something like Sailfish
> > >> without a major capital expenditure and/or downtime and/or data loss?
> > >
> > >Well, we are not asking for KFS to replace HDFS.  One path you could
> > >take is to experiment with Sailfish---use KFS just for the
> > >intermediate data and HDFS for everything else.  There is no major
> > >capex :).  While you get comfy with pushing intermediate data into a
> > >DFS, we get the ideas added to HDFS.  This simplifies deployment
> > >considerations.
> > >
> > >> However, can the Sailfish I-files implementation be plugged in as an
> > >> alternate Shuffle implementation in MRv2 (see MAPREDUCE-3060 and
> > >> MAPREDUCE-4049),
> > >
> > >This'd be great!
> > >
> > >> with necessary additional plumbing for dynamic
> > >> adjustment of reduce task population? And the workbuilder could be
> > >> part of an alternate MapReduce Application Manager?
> > >
> > >It should be part of the AM.  (Currently, with our implementation in
> > >Hadoop-0.20.2, the workbuilder serves the role of an AM).
> > >
> > >> The I-file concept could possibly be implemented here in a fairly self
> > contained way. One
> > >> could even colocate/embed a KFS filesystem with such an alternate
> > >> shuffle, like how MR task temporary space is usually colocated with
> > >> HDFS storage.
> > >
> > >Exactly.
> > >
> > >> Does this seem reasonable in any way?
> > >
> > >Great. Where do go from here?  How do we get a colloborative effort
> going?
> > >
> > >Best,
> > >
> > >Sriram
> > >
> > >>>  From: Sriram Rao <srirams...@gmail.com>
> > >>> To: common-dev@hadoop.apache.org
> > >>> Sent: Tuesday, May 8, 2012 10:32 AM
> > >>> Subject: Project announcement: Sailfish (also, looking for
> > colloborators)
> > >>>
> > >>> Hi,
> > >>>
> > >>> I'd like to announce the release of a new open source project,
> > Sailfish.
> > >>>
> > >>> http://code.google.com/p/sailfish/
> > >>>
> > >>> Sailfish tries to improve Hadoop-performance, particularly for
> > large-jobs
> > >>> which process TB's of data and run for hours.  In building Sailfish,
> we
> > >>> modify how map-output is handled and transported from map->reduce.
> > >>>
> > >>> The project pages provide more information about the project.
> > >>>
> > >>> We are looking for colloborators who can help get some of the ideas
> > into
> > >>> Apache Hadoop. A possible step forward could be to make "shuffle"
> > phase of
> > >>> Hadoop pluggable.
> > >>>
> > >>> If you are interested in working with us, please get in touch with
> me.
> > >>>
> > >>> Sriram
> > >>
> > >
> > >
> > >
> > >--
> > >Best regards,
> > >
> > >   - Andy
> > >
> > >Problems worthy of attack prove their worth by hitting back. - Piet
> > >Hein (via Tom White)
> > >
> > >
> > >
> >
>

Re: Sailfish

Reply via email to