Re: [DISCUSS] Fault tolerant BSP job

Suraj Menon Wed, 09 Apr 2014 08:57:12 -0700

I don't like my patch in HAMA-639 myself, eventhough I believe it satisfies
all the mentioned requirements. The usage of superstep chaining API
implementation in the patch is too complicated. A superstep here is like a
transformation function you define on an RDD in Spark. So if you look into
FT design of Spark, on failure, they rerun the operations on the RDD to get
to the current state. This is similar to what we have in mind using
checkpointing. The challenge is in getting the same messages replayed to
newly spawned task on checkpointed data. If you don't use the Superstep(or
any other abstraction representing a function) you cannot start processing
from a line of code where the failure occurred. (Java does not support goto
line number.)


-Suraj


On Wed, Apr 9, 2014 at 7:29 AM, Edward J. Yoon <[email protected]>wrote:

> I just found this: https://issues.apache.org/jira/browse/HAMA-503 and
> HAMA-639.
>
> Do you still think superstep API is essential for checkpoint/recovery?
> If not, we can drop it. I don't think it's good idea.
>
> On Wed, Apr 9, 2014 at 7:43 PM, Chia-Hung Lin <[email protected]>
> wrote:
> > Not very sure if we sync at the same page. And sorry I am not very
> > familiar with Superstep implementation.
> >
> > I assume that traditional bsp model means the original bsp interface
> > where there is a bsp function and user can freely call peer.sync(),
> > etc. methods
> >
> > .... bsp(BSPPeer ... peer) {
> >     // whatever computation
> >     peer.sync();
> > }
> >
> > And the superstep style is with Superstep abstract class.
> >
> > If this is the case, SuperstepBSP.java has already call sync, as
> > below, outside each Superstep.compute(). So it looks like even
> > SuperstepPiEstimator doesn't call sync() method, barrier sync will be
> > executed because each Superstep is viewed as a superstep in original
> > BSP definition.
> >
> >   @Override
> >   public void bsp(BSPPeer<K1, V1, K2, V2, M> peer) throws IOException,
> >       SyncException, InterruptedException {
> >     for (int index = startSuperstep; index < supersteps.length; index++)
> {
> >       Superstep<K1, V1, K2, V2, M> superstep = supersteps[index];
> >       superstep.compute(peer);
> >       if (superstep.haltComputation(peer)) {
> >         break;
> >       }
> >       peer.sync();
> >       startSuperstep = 0;
> >     }
> >   }
> >
> > Within the Superstep.compute(), if sync is called again, I would think
> > that another barrier sync will be executed.
> >
> > SuperstepBSP.java
> >
> > for(...) {
> >   superstep .compute() -> { // in compute method
> >     ...
> >     peer.sync()
> >   }
> >   ...
> >   peer.sync()
> > }
> >
> > IIRC each call to sync may raise the checkpoint (no recovery) method
> > serialize message to hdfs.
> >
> > For SerializePrinting, following code snippet  may move
> >
> > for (String otherPeer : bspPeer.getAllPeerNames()) {
> >         bspPeer.send(otherPeer, new
> IntegerMessage(bspPeer.getPeerName(), i));
> > }
> >
> > to Superstep.compute()
> >
> > And the outer for loop is what is programmed in SuperstepBSP.java
> >
> > for (int i = 0; i < NUM_SUPERSTEPS; i++) {
> >     // code that should be moved to Superstep.compute()
> > }
> > bspPeer.sync();
> >
> >
> >
> > On 9 April 2014 16:17, Edward J. Yoon <[email protected]> wrote:
> >> As you can see here[1], the sync() method never called, and an classes
> >> of all superstars were needed to be declared within Job configuration.
> >> Therefore, I thought it's similar with Pregel style on BSP model. It's
> >> quite different from legacy model in my eyes.
> >>
> >> According to HAMA-505, superstep API seems used for FT job processing
> >> (I didn't read closely yet). Right? In here, I have an questions. What
> >> happens if I call the sync() method within compute() method? In this
> >> case, framework guarantees the checkpoint/recovery? And how can I
> >> implement the http://wiki.apache.org/hama/SerializePrinting using
> >> superstep API?
> >>
> >>> What's difference between pure BSP and FT BSP? Any concrete example?
> >>
> >> I was mean the traditional BSP programming model.
> >>
> >> 1.
> http://svn.apache.org/repos/asf/hama/trunk/examples/src/main/java/org/apache/hama/examples/SuperstepPiEstimator.java
> >>
> >> On Wed, Apr 9, 2014 at 4:25 PM, Chia-Hung Lin <[email protected]>
> wrote:
> >>> Sorry don't catch the point.
> >>>
> >>> What's difference between pure BSP and FT BSP? Any concrete example?
> >>>
> >>>
> >>> On 9 April 2014 08:29, Edward J. Yoon <[email protected]> wrote:
> >>>> In my eyes, SuperstepPiEstimator[1] look like totally new programming
> >>>> model, very similar with Pregel.
> >>>>
> >>>> I personally would like to suggest that we provide both pure BSP and
> >>>> fault tolerant BSP model, instead of replace.
> >>>>
> >>>> 1.
> http://svn.apache.org/repos/asf/hama/trunk/examples/src/main/java/org/apache/hama/examples/SuperstepPiEstimator.java
> >>>>
> >>>> --
> >>>> Edward J. Yoon (@eddieyoon)
> >>>> Chief Executive Officer
> >>>> DataSayer, Inc.
> >>
> >>
> >>
> >> --
> >> Edward J. Yoon (@eddieyoon)
> >> CEO at DataSayer Co., Ltd.
>
>
>
> --
> Edward J. Yoon (@eddieyoon)
> Chief Executive Officer
> DataSayer Co., Ltd.
>

Re: [DISCUSS] Fault tolerant BSP job

Reply via email to