[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-08-28 Thread Nic Eggert (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144089#comment-16144089
 ] 

Nic Eggert commented on GIRAPH-1139:


Not an expert, but that build failure looks unrelated.

> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-08-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144084#comment-16144084
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user edunov commented on the issue:

https://github.com/apache/giraph/pull/30
  
I've committed this one. Let me know if there are any issues. 


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-08-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144079#comment-16144079
 ] 

Hudson commented on GIRAPH-1139:


FAILURE: Integrated in Jenkins build Giraph-trunk-Commit #1708 (See 
[https://builds.apache.org/job/Giraph-trunk-Commit/1708/])
GIRAPH-1139 (edunov: 
[http://git-wip-us.apache.org/repos/asf?p=giraph.git&a=commit&h=cc489350eba8db8bc62a4fca77b7ad9aa14bcf3d])
* (edit) giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java
* (edit) 
giraph-core/src/main/java/org/apache/giraph/worker/BspServiceWorker.java
* (edit) 
giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-08-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144080#comment-16144080
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user asfgit closed the pull request at:

https://github.com/apache/giraph/pull/30


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-08-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140485#comment-16140485
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user edunov commented on the issue:

https://github.com/apache/giraph/pull/30
  
LGTM! Thanks @neggert 
@majakabiljo  do you want to land it or should I? 


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-08-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138529#comment-16138529
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user neggert commented on the issue:

https://github.com/apache/giraph/pull/30
  
Ping @edunov 


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-06-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049382#comment-16049382
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user neggert commented on the issue:

https://github.com/apache/giraph/pull/30
  
Fixed @majakabiljo's comments. @edunov 


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-06-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049379#comment-16049379
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user neggert commented on a diff in the pull request:

https://github.com/apache/giraph/pull/30#discussion_r122006839
  
--- Diff: giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java 
---
@@ -272,7 +273,7 @@ public BspService(
 }
 if (LOG.isInfoEnabled()) {
   LOG.info("BspService: Connecting to ZooKeeper with job " + jobId +
-  ", " + getTaskPartition() + " on " + serverPortList);
+  ", " + getTaskId() + " on " + serverPortList);
--- End diff --

We can't actually set `taskId` before creating the ZK client, since we need 
ZK to get the get the application attempt. I changed this to log the partition 
instead, which we can get from `conf`.


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-06-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033370#comment-16033370
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user majakabiljo commented on a diff in the pull request:

https://github.com/apache/giraph/pull/30#discussion_r119682732
  
--- Diff: giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java 
---
@@ -288,6 +289,9 @@ public BspService(
   throw new RuntimeException(e);
 }
 
+this.taskId = (int)getApplicationAttempt() * conf.getMaxWorkers() + 
conf.getTaskPartition();
--- End diff --

Please fix checkstyle (make sure 'mvn verify' passes)


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-06-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033371#comment-16033371
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user majakabiljo commented on a diff in the pull request:

https://github.com/apache/giraph/pull/30#discussion_r119682567
  
--- Diff: giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java 
---
@@ -272,7 +273,7 @@ public BspService(
 }
 if (LOG.isInfoEnabled()) {
   LOG.info("BspService: Connecting to ZooKeeper with job " + jobId +
-  ", " + getTaskPartition() + " on " + serverPortList);
+  ", " + getTaskId() + " on " + serverPortList);
--- End diff --

We need to set taskId before this


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-05-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16031358#comment-16031358
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user neggert commented on the issue:

https://github.com/apache/giraph/pull/30
  
ping @edunov @majakabiljo 


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-04-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981327#comment-15981327
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user neggert commented on the issue:

https://github.com/apache/giraph/pull/30
  
This is ready for another look. I've replaced partition id with task id. 
The only place that actually needs a partition id is logging missing workers in 
`BspServiceMaster`. In that case, it's easy enough to recover the partition id 
by taking the task id modulo the number of workers.


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-04-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15980471#comment-15980471
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user neggert commented on the issue:

https://github.com/apache/giraph/pull/30
  
There actually is [one instance][1] where I think it's okay to use 
`getHostnamePartitionId`. This happens in `BspServiceMaster.becomeMaster` when 
creating a master bid in ZK, before any `TaskInfo` instance is created.

This does make me realize, though, that I need to make the same change to 
how the task id is set in `BspServiceMaster`.

What about just changing how `taskPartition` is set in `BspService`, like 
so?

this.taskPartition = (int)getApplicationAttempt() * 
conf.getMaxWorkers() + getTaskPartition();

I think this is actually the minimal code change to fix the issue. I don't 
see anywhere in the code that actually cares about the task partition as 
anything other than a unique identifier.

[1]: 
https://github.com/apache/giraph/blob/trunk/giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java#L805


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-04-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15979370#comment-15979370
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user majakabiljo commented on the issue:

https://github.com/apache/giraph/pull/30
  
Can we get rid of getHostnamePartitionId() to avoid incorrectly using it in 
the future? I see various other places where taskPartition is used for 
identifier, do any of them need to be updated too?


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-04-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971326#comment-15971326
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user neggert commented on a diff in the pull request:

https://github.com/apache/giraph/pull/30#discussion_r111771953
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java ---
@@ -1734,7 +1735,7 @@ private CheckpointStatus getCheckpointStatus(long 
superstep) {
 if (checkpointFrequency == 0) {
   return CheckpointStatus.NONE;
 }
-long firstCheckpoint = INPUT_SUPERSTEP + 1 + checkpointFrequency;
+long firstCheckpoint = INPUT_SUPERSTEP + 1;
--- End diff --

This will actually checkpoint before running superstep 0. You're basically 
checkpointing the input loading work.


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-04-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971315#comment-15971315
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user neggert commented on the issue:

https://github.com/apache/giraph/pull/30
  
I'm finding that roughly half of my ~2 hour job time is spent just loading 
the graph. Resuming from checkpoint lets me skip that step if something fails. 
This is also why I want to make sure that Giraph checkpoints before starting 
superstep 0. (It used to work this way, and it's documented in the Giraph book.)


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-04-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971306#comment-15971306
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user edunov commented on a diff in the pull request:

https://github.com/apache/giraph/pull/30#discussion_r111767933
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java ---
@@ -1734,7 +1735,7 @@ private CheckpointStatus getCheckpointStatus(long 
superstep) {
 if (checkpointFrequency == 0) {
   return CheckpointStatus.NONE;
 }
-long firstCheckpoint = INPUT_SUPERSTEP + 1 + checkpointFrequency;
+long firstCheckpoint = INPUT_SUPERSTEP + 1;
--- End diff --

What is the reason for changing this? Do you want it to always do 
checkpoint after the first superstep?


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-04-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969747#comment-15969747
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user edunov commented on the issue:

https://github.com/apache/giraph/pull/30
  
Hi Nic, thank you for fixing this and sorry for the delay. 
I'll take a look at this diff other the weekend


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963189#comment-15963189
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user neggert commented on the issue:

https://github.com/apache/giraph/pull/30
  
Could someone please take a look at this? @majakabiljo @edunov 


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-03-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943980#comment-15943980
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


GitHub user neggert opened a pull request:

https://github.com/apache/giraph/pull/30

[GIRAPH-1139] Fix resuming from checkpoint

A couple of fixes that get resuming from checkpoint working.

* Set checkpointStatus to NONE in master when restarting from checkpoint.

Workers already do this, so the job hangs when restarting from checkpoint
while the master waits for workers to create checkpoints they're never
going to create.

* Set unique task id for each worker attempt

Previously, a worker would reuse the task id from the prior attempt. This
gets propagated to the Netty client id, which makes the master think it has
already processed any requests that come from that client, causing it to
discard them. This obviously causes problems.

And also a fix for GIRAPH-1136. We will now checkpoint on superstep 0 if 
checkpointing is enabled. Let me know if you'd rather I sent a separate PR for 
this.

Testing:
Ran custom Label Propagation implementation with checkpointing on a ~5b 
node graph. Manually killed workers (by logging in to worker node and running 
`kill -9 `. Ensured that Giraph successfully resumed from most recent 
checkpoint.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/neggert/giraph trunk_resume_fixes

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/30.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #30


commit 6462d48e46d84cd6aa5ecd5817b0d057ce3a6c1f
Author: NicEggert 
Date:   2017-03-23T20:23:03Z

Set checkpointStatus to NONE in master when restarting from checkpoint.

Workers already do this, so the job hangs when restarting from checkpoint
while the master waits for workers to create checkpoints they're never
going to create.

commit 3ed8c18a3bc97c910e364bf7d48d50be25df704c
Author: NicEggert 
Date:   2017-03-23T20:26:02Z

Checkpoint on superstep 0 if checkpointing is enabled

commit 74bba4573dbb77242d81352f84969b114db1cb71
Author: NicEggert 
Date:   2017-03-23T20:26:47Z

Set unique task id for each worker attempt

Previously, a worker would reuse the task id from the prior attempt. This
gets propagated to the Netty client id, which makes the master think it has
already processed any requests that come from that client, causing it to
discard them. This obviously causes problems.




> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)