Srivas, Sailfish is builds upon record append (a feature not present in HDFS).
The software that is currently released is based on Hadoop-0.20.2. You use the Sailfish version of Hadoop-0.20.2, KFS for the intermediate data, and then HDFS (or KFS) for storing the job/input. Since the changes are all in the handling of map output/reduce input, it is transparent to existing jobs. What is being proposed below is to bolt all the starting/stopping of the related deamons into YARN as a first step. There are other approaches that are possible, which have a similar effect. Hope this helps. Sriram On Thu, May 10, 2012 at 10:50 PM, M. C. Srivas <mcsri...@gmail.com> wrote: > Sriram, Sailfish depends on append. I just noticed the HDFS disabled > append. How does one use this with Hadoop? > > > On Wed, May 9, 2012 at 9:00 AM, Otis Gospodnetic < > otis_gospodne...@yahoo.com > > wrote: > > > Hi Sriram, > > > > >> The I-file concept could possibly be implemented here in a fairly self > > contained way. One > > >> could even colocate/embed a KFS filesystem with such an alternate > > >> shuffle, like how MR task temporary space is usually colocated with > > >> HDFS storage. > > > > > Exactly. > > > > >> Does this seem reasonable in any way? > > > > > Great. Where do go from here? How do we get a colloborative effort > > going? > > > > > > Sounds like a JIRA issue should be opened, the approach briefly > described, > > and the first implementation attempt made. Then iterate. > > > > I look forward to seeing this! :) > > > > Otis > > -- > > > > Performance Monitoring for Solr / ElasticSearch / HBase - > > http://sematext.com/spm > > > > > > > > >________________________________ > > > From: Sriram Rao <srirams...@gmail.com> > > >To: common-dev@hadoop.apache.org > > >Sent: Tuesday, May 8, 2012 6:48 PM > > >Subject: Re: Sailfish > > > > > >Dear Andy, > > > > > >> From: Andrew Purtell <apurt...@apache.org> > > >> ... > > > > > >> Do you intend this to be a joint project with the Hadoop community or > > >> a technology competitor? > > > > > >As I had said in my email, we are looking for folks to colloborate > > >with us to help get us integrated with Hadoop. So, to be explicitly > > >clear, we are intending for this to be a joint project with the > > >community. > > > > > >> Regrettably, KFS is not a "drop in replacement" for HDFS. > > >> Hypothetically: I have several petabytes of data in an existing HDFS > > >> deployment, which is the norm, and a continuous MapReduce workflow. > > >> How do you propose I, practically, migrate to something like Sailfish > > >> without a major capital expenditure and/or downtime and/or data loss? > > > > > >Well, we are not asking for KFS to replace HDFS. One path you could > > >take is to experiment with Sailfish---use KFS just for the > > >intermediate data and HDFS for everything else. There is no major > > >capex :). While you get comfy with pushing intermediate data into a > > >DFS, we get the ideas added to HDFS. This simplifies deployment > > >considerations. > > > > > >> However, can the Sailfish I-files implementation be plugged in as an > > >> alternate Shuffle implementation in MRv2 (see MAPREDUCE-3060 and > > >> MAPREDUCE-4049), > > > > > >This'd be great! > > > > > >> with necessary additional plumbing for dynamic > > >> adjustment of reduce task population? And the workbuilder could be > > >> part of an alternate MapReduce Application Manager? > > > > > >It should be part of the AM. (Currently, with our implementation in > > >Hadoop-0.20.2, the workbuilder serves the role of an AM). > > > > > >> The I-file concept could possibly be implemented here in a fairly self > > contained way. One > > >> could even colocate/embed a KFS filesystem with such an alternate > > >> shuffle, like how MR task temporary space is usually colocated with > > >> HDFS storage. > > > > > >Exactly. > > > > > >> Does this seem reasonable in any way? > > > > > >Great. Where do go from here? How do we get a colloborative effort > going? > > > > > >Best, > > > > > >Sriram > > > > > >>> From: Sriram Rao <srirams...@gmail.com> > > >>> To: common-dev@hadoop.apache.org > > >>> Sent: Tuesday, May 8, 2012 10:32 AM > > >>> Subject: Project announcement: Sailfish (also, looking for > > colloborators) > > >>> > > >>> Hi, > > >>> > > >>> I'd like to announce the release of a new open source project, > > Sailfish. > > >>> > > >>> http://code.google.com/p/sailfish/ > > >>> > > >>> Sailfish tries to improve Hadoop-performance, particularly for > > large-jobs > > >>> which process TB's of data and run for hours. In building Sailfish, > we > > >>> modify how map-output is handled and transported from map->reduce. > > >>> > > >>> The project pages provide more information about the project. > > >>> > > >>> We are looking for colloborators who can help get some of the ideas > > into > > >>> Apache Hadoop. A possible step forward could be to make "shuffle" > > phase of > > >>> Hadoop pluggable. > > >>> > > >>> If you are interested in working with us, please get in touch with > me. > > >>> > > >>> Sriram > > >> > > > > > > > > > > > >-- > > >Best regards, > > > > > > - Andy > > > > > >Problems worthy of attack prove their worth by hitting back. - Piet > > >Hein (via Tom White) > > > > > > > > > > > >