Re: Review Request 23989: Improve checkpointing

Maja Kabiljo Wed, 30 Jul 2014 21:20:08 -0700


> On July 30, 2014, 9:08 p.m., Maja Kabiljo wrote:
> > Did you test automatic restart of jobs after checkpoint? 
> > 
> > Maybe I'm missing something, because it looks to me that after checkpoint 
> > we finish as if halt was called, meaning we store output. Is this the case?
> 
> Sergey Edunov wrote:
>     I tested it with PageRank. After checkpoint we do finish. What's going to 
> happen next depends on how you configure GiraphJobRetryChecker if it is 
> configured to restart the job will restart from the checkpoint. It is not 
> configured this way in open source version (and I'm not sure if it should be)


I agree that it shouldn't be default, but since you are adding it as an option 
you should test it out?

Please answer the second question about storing output.


> On July 30, 2014, 9:08 p.m., Maja Kabiljo wrote:
> > giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java, lines 
> > 165-169
> > <https://reviews.apache.org/r/23989/diff/2/?file=643671#file643671line165>
> >
> >     Why do we need two separate flags here?
> 
> Sergey Edunov wrote:
>     I can't trust flag set by user as it may be set after some workers passed 
> checkpointing stage while others don't. So the logic is: 
>     1) User sets _checkpointAndStop
>     2) Master checks if _checkpointAndStop is set and sets 
> _forceCheckpointAndStopInternal if so. It happens before barrier  
>     3) After barrier workers check if _forceCheckpointAndStopInternal is set, 
> do checkpoint and stop if thats the case

Oh I see, because in normal case currently each worker independently decides 
about whether we should store checkpoint (unlike for example halting where 
master says what's gonna happen). Should we change this?


> On July 30, 2014, 9:08 p.m., Maja Kabiljo wrote:
> > giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java, 
> > line 1624
> > <https://reviews.apache.org/r/23989/diff/2/?file=643683#file643683line1624>
> >
> >     Why do we ignore death here?
> 
> Sergey Edunov wrote:
>     Right, it should be conditioned on checkpointStatus. If it is set to 
> CHECKPOINT_AND_HALT workers will die after checkpointing, we don't wan't 
> master to die in this case.

Hm, you mean they'll die when they finish storing and you want master to be 
alive till the last one is done or what?


> On July 30, 2014, 9:08 p.m., Maja Kabiljo wrote:
> > giraph-core/src/main/java/org/apache/giraph/conf/GiraphConstants.java, line 
> > 1149
> > <https://reviews.apache.org/r/23989/diff/2/?file=643677#file643677line1149>
> >
> >     What is this?
> 
> Sergey Edunov wrote:
>     It used to be class name, but not all versions of hadoop support class 
> name. All of them support extensions though

And what are the valid options here, maybe update the comment?


- Maja


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23989/#review49163
-----------------------------------------------------------


On July 28, 2014, 5:25 p.m., Sergey Edunov wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23989/
> -----------------------------------------------------------
> 
> (Updated July 28, 2014, 5:25 p.m.)
> 
> 
> Review request for giraph.
> 
> 
> Repository: giraph-git
> 
> 
> Description
> -------
> 
> We need to address some issues with checkpointing:
> 1) worker2worker messages are not saved
> 2) BspServiceWorker does not compile under hadoop_0.23 profile
> 3) it would be nice to be able to manually checkpoint and stop any job at any 
> point of time.
> 
> Changes:
> 
> 1) worker2worker messages fixed my serializing currentworkertoworker messages 
> (it is a list of writable so I had to write class information as well)
> 2) Compilation issues fixed
> 3) The way you can trigger checkpointing now is by creating 
> /_checkpointAndStop node in zookeeper (same way as _haltComputation works) 
> After that the behavior of the job will be determined by registered 
> GiraphJobRetryChecker. By default, job will get checkpointed at the end of 
> current superstep and halted. You can override this behavior by making 
> shouldRestartCheckpoint() return true, in this case job will be restarted 
> immediately after getting checkpointed.
> 
> 
> Diffs
> -----
> 
>   giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java 02577b9 
>   giraph-core/src/main/java/org/apache/giraph/bsp/CentralizedService.java 
> ff3e427 
>   
> giraph-core/src/main/java/org/apache/giraph/bsp/CentralizedServiceMaster.java 
> e5b7cf3 
>   giraph-core/src/main/java/org/apache/giraph/bsp/CheckpointStatus.java 
> PRE-CREATION 
>   giraph-core/src/main/java/org/apache/giraph/bsp/SuperstepState.java c384fbf 
>   giraph-core/src/main/java/org/apache/giraph/comm/ServerData.java 29488fc 
>   giraph-core/src/main/java/org/apache/giraph/conf/GiraphConstants.java 
> 0424a47 
>   giraph-core/src/main/java/org/apache/giraph/graph/GraphTaskManager.java 
> 684f4eb 
>   
> giraph-core/src/main/java/org/apache/giraph/job/DefaultGiraphJobRetryChecker.java
>  0cab86c 
>   giraph-core/src/main/java/org/apache/giraph/job/GiraphJob.java 4a1f02e 
>   giraph-core/src/main/java/org/apache/giraph/job/GiraphJobRetryChecker.java 
> 53a800e 
>   giraph-core/src/main/java/org/apache/giraph/job/HadoopUtils.java 9530fd6 
>   giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java 
> e129390 
>   giraph-core/src/main/java/org/apache/giraph/master/MasterThread.java 
> 0635210 
>   giraph-core/src/main/java/org/apache/giraph/utils/WritableUtils.java 
> 3f8382e 
>   giraph-core/src/main/java/org/apache/giraph/worker/BspServiceWorker.java 
> d2d24ee 
>   giraph-core/src/test/java/org/apache/giraph/utils/TestWritableUtils.java 
> PRE-CREATION 
>   giraph-examples/src/test/java/org/apache/giraph/TestCheckpointing.java 
> 2939af7 
>   pom.xml de25499 
> 
> Diff: https://reviews.apache.org/r/23989/diff/
> 
> 
> Testing
> -------
> 
> Run pagerank, will keep testing with different jobs.
> 
> 
> Thanks,
> 
> Sergey Edunov
> 
>

Re: Review Request 23989: Improve checkpointing

Reply via email to