-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23989/#review49163
-----------------------------------------------------------


Did you test automatic restart of jobs after checkpoint? 

Maybe I'm missing something, because it looks to me that after checkpoint we 
finish as if halt was called, meaning we store output. Is this the case?


giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java
<https://reviews.apache.org/r/23989/#comment86003>

    Why do we need two separate flags here?



giraph-core/src/main/java/org/apache/giraph/conf/GiraphConstants.java
<https://reviews.apache.org/r/23989/#comment86004>

    What is this?



giraph-core/src/main/java/org/apache/giraph/job/GiraphJob.java
<https://reviews.apache.org/r/23989/#comment86005>

    Why don't we just use latest checkpoint?



giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java
<https://reviews.apache.org/r/23989/#comment86006>

    Why do we ignore death here?


- Maja Kabiljo


On July 28, 2014, 5:25 p.m., Sergey Edunov wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23989/
> -----------------------------------------------------------
> 
> (Updated July 28, 2014, 5:25 p.m.)
> 
> 
> Review request for giraph.
> 
> 
> Repository: giraph-git
> 
> 
> Description
> -------
> 
> We need to address some issues with checkpointing:
> 1) worker2worker messages are not saved
> 2) BspServiceWorker does not compile under hadoop_0.23 profile
> 3) it would be nice to be able to manually checkpoint and stop any job at any 
> point of time.
> 
> Changes:
> 
> 1) worker2worker messages fixed my serializing currentworkertoworker messages 
> (it is a list of writable so I had to write class information as well)
> 2) Compilation issues fixed
> 3) The way you can trigger checkpointing now is by creating 
> /_checkpointAndStop node in zookeeper (same way as _haltComputation works) 
> After that the behavior of the job will be determined by registered 
> GiraphJobRetryChecker. By default, job will get checkpointed at the end of 
> current superstep and halted. You can override this behavior by making 
> shouldRestartCheckpoint() return true, in this case job will be restarted 
> immediately after getting checkpointed.
> 
> 
> Diffs
> -----
> 
>   giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java 02577b9 
>   giraph-core/src/main/java/org/apache/giraph/bsp/CentralizedService.java 
> ff3e427 
>   
> giraph-core/src/main/java/org/apache/giraph/bsp/CentralizedServiceMaster.java 
> e5b7cf3 
>   giraph-core/src/main/java/org/apache/giraph/bsp/CheckpointStatus.java 
> PRE-CREATION 
>   giraph-core/src/main/java/org/apache/giraph/bsp/SuperstepState.java c384fbf 
>   giraph-core/src/main/java/org/apache/giraph/comm/ServerData.java 29488fc 
>   giraph-core/src/main/java/org/apache/giraph/conf/GiraphConstants.java 
> 0424a47 
>   giraph-core/src/main/java/org/apache/giraph/graph/GraphTaskManager.java 
> 684f4eb 
>   
> giraph-core/src/main/java/org/apache/giraph/job/DefaultGiraphJobRetryChecker.java
>  0cab86c 
>   giraph-core/src/main/java/org/apache/giraph/job/GiraphJob.java 4a1f02e 
>   giraph-core/src/main/java/org/apache/giraph/job/GiraphJobRetryChecker.java 
> 53a800e 
>   giraph-core/src/main/java/org/apache/giraph/job/HadoopUtils.java 9530fd6 
>   giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java 
> e129390 
>   giraph-core/src/main/java/org/apache/giraph/master/MasterThread.java 
> 0635210 
>   giraph-core/src/main/java/org/apache/giraph/utils/WritableUtils.java 
> 3f8382e 
>   giraph-core/src/main/java/org/apache/giraph/worker/BspServiceWorker.java 
> d2d24ee 
>   giraph-core/src/test/java/org/apache/giraph/utils/TestWritableUtils.java 
> PRE-CREATION 
>   giraph-examples/src/test/java/org/apache/giraph/TestCheckpointing.java 
> 2939af7 
>   pom.xml de25499 
> 
> Diff: https://reviews.apache.org/r/23989/diff/
> 
> 
> Testing
> -------
> 
> Run pagerank, will keep testing with different jobs.
> 
> 
> Thanks,
> 
> Sergey Edunov
> 
>

Reply via email to