Our messaging service is reliable? It'd be nice if we add some validation unit test.
On Sun, Aug 19, 2012 at 12:22 AM, Thomas Jungblut <[email protected]> wrote: > And it does not fail in rev 1350261. > > 2012/8/18 Thomas Jungblut <[email protected]> > >> also happens in local mode. but the testcase had turned it off. >> >> >> 2012/8/18 Thomas Jungblut <[email protected]> >> >>> Okay still no idea. >>> >>> I know that this happens when the master tasks sends a message with the >>> updated aggregator values to every slave. (line 239). Then this only >>> message should be consumed arround line 246 cc. >>> >>> But it still remains in the buffer and will be consumed after all >>> computation in line 209 in the parseMessages. >>> Even clearing the buffer is not fixing it. >>> >>> I should rewrite this whole crapyard to the new superstep API. -.- >>> >>> >>> 2012/8/18 Thomas Jungblut <[email protected]> >>> >>>> Okay two possible problems: >>>> - It changes the host because of fault tolerance (contra arguments: its >>>> turned off and the port is smaller than the other one) >>>> - Messaging is broken (would also explain why pagerank does not converge >>>> anymore) >>>> >>>> Can anyone else reproduce this scenario? >>>> >>>> 2012/8/18 Apache Jenkins Server <[email protected]> >>>> >>>>> See <https://builds.apache.org/job/Hama-Nightly/645/changes> >>>>> >>>>> Changes: >>>>> >>>>> [tjungblut] add missing class >>>>> >>>>> ------------------------------------------ >>>>> [...truncated 4232 lines...] >>>>> >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 4 >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ true >>>>> janus.apache.org:61002 >>>>> >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0 >>>>> >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 4 >>>>> >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ true >>>>> janus.apache.org:61002 >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0 >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 7 >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ true >>>>> janus.apache.org:61002 >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ >>>>> VAL=0.4572019638123739 >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 7 >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ true >>>>> janus.apache.org:61002 >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ >>>>> VAL=0.44247448197562855 >>>>> >>>>> 12/08/17 23:05:52 INFO server.PrepRequestProcessor: Got user-level >>>>> KeeperException when processing sessionid:0x13936d5d8cc0002 type:create >>>>> cxid:0x3de zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error >>>>> Path:/bsp/job_201208172305_0001/sync/51 Error:KeeperErrorCode = NodeExists >>>>> for /bsp/job_201208172305_0001/sync/51 >>>>> >>>>> 12/08/17 23:05:52 INFO server.PrepRequestProcessor: Got user-level >>>>> KeeperException when processing sessionid:0x13936d5d8cc0002 type:create >>>>> cxid:0x3e8 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error >>>>> Path:/bsp/job_201208172305_0001/sync/51/ready Error:KeeperErrorCode = >>>>> NodeExists for /bsp/job_201208172305_0001/sync/51/ready >>>>> >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 4 >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ true >>>>> janus.apache.org:61002 >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0 >>>>> >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 4 >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ true >>>>> janus.apache.org:61002 >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0 >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 7 >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ true >>>>> janus.apache.org:61002 >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ >>>>> VAL=0.457610574551534 >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 7 >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ true >>>>> janus.apache.org:61002 >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ >>>>> VAL=0.2675231554874198 >>>>> >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 11 >>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ NULL! false >>>>> janus.apache.org:61001 >>>>> >>>>> 12/08/17 23:05:52 ERROR bsp.BSPTask: Error running bsp setup and bsp >>>>> function. >>>>> java.lang.NullPointerException >>>>> at >>>>> org.apache.hama.graph.GraphJobRunner.parseMessages(GraphJobRunner.java:373) >>>>> at >>>>> org.apache.hama.graph.GraphJobRunner.bsp(GraphJobRunner.java:209) >>>>> at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:166) >>>>> at org.apache.hama.bsp.BSPTask.run(BSPTask.java:143) >>>>> at >>>>> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1271) >>>>> >>>>> 12/08/17 23:05:52 INFO server.PrepRequestProcessor: Processed session >>>>> termination for sessionid: 0x13936d5d8cc0002 >>>>> >>>>> 12/08/17 23:05:52 INFO zookeeper.ZooKeeper: Session: 0x13936d5d8cc0002 >>>>> closed >>>>> >>>>> 12/08/17 23:05:52 INFO zookeeper.ClientCnxn: EventThread shut down >>>>> >>>>> 12/08/17 23:05:52 INFO server.NIOServerCnxn: Closed socket connection >>>>> for client /127.0.0.1:54114 which had sessionid 0x13936d5d8cc0002 >>>>> >>>>> 12/08/17 23:05:52 INFO ipc.NettyServer: [id: 0x001cb7a1, / >>>>> 67.195.138.60:58913 :> /67.195.138.60:61001] DISCONNECTED >>>>> >>>>> 12/08/17 23:05:52 INFO ipc.NettyTransceiver: [id: 0x0136d9d8, / >>>>> 67.195.138.60:58913 :> janus.apache.org/67.195.138.60:61001] >>>>> DISCONNECTED >>>>> >>>>> 12/08/17 23:05:52 INFO ipc.NettyTransceiver: [id: 0x0136d9d8, / >>>>> 67.195.138.60:58913 :> janus.apache.org/67.195.138.60:61001] UNBOUND >>>>> >>>>> 12/08/17 23:05:52 INFO ipc.NettyServer: [id: 0x001cb7a1, / >>>>> 67.195.138.60:58913 :> /67.195.138.60:61001] UNBOUND >>>>> >>>>> 12/08/17 23:05:52 INFO ipc.NettyTransceiver: [id: 0x0136d9d8, / >>>>> 67.195.138.60:58913 :> janus.apache.org/67.195.138.60:61001] CLOSED >>>>> >>>>> 12/08/17 23:05:52 INFO ipc.NettyServer: [id: 0x001cb7a1, / >>>>> 67.195.138.60:58913 :> /67.195.138.60:61001] CLOSED >>>>> >>>>> 12/08/17 23:05:52 INFO ipc.NettyTransceiver: Remote peer >>>>> janus.apache.org/67.195.138.60:61001 closed connection. >>>>> >>>>> 12/08/17 23:05:52 INFO ipc.NettyServer: [id: 0x01bf7b23, / >>>>> 67.195.138.60:58915 :> /67.195.138.60:61001] DISCONNECTED >>>>> >>>>> 12/08/17 23:05:52 INFO ipc.NettyTransceiver: Disconnecting from >>>>> janus.apache.org/67.195.138.60:61001 >>>>> >>>>> 12/08/17 23:05:52 INFO ipc.NettyServer: [id: 0x01bf7b23, / >>>>> 67.195.138.60:58915 :> /67.195.138.60:61001] UNBOUND >>>>> >>>>> 12/08/17 23:05:52 INFO ipc.NettyTransceiver: [id: 0x002c17f7, / >>>>> 67.195.138.60:58915 :> janus.apache.org/67.195.138.60:61001] >>>>> DISCONNECTED >>>>> >>>>> 12/08/17 23:05:52 INFO ipc.NettyServer: [id: 0x01bf7b23, / >>>>> 67.195.138.60:58915 :> /67.195.138.60:61001] CLOSED >>>>> >>>>> 12/08/17 23:05:52 INFO ipc.NettyTransceiver: [id: 0x002c17f7, / >>>>> 67.195.138.60:58915 :> janus.apache.org/67.195.138.60:61001] UNBOUND >>>>> 12/08/17 23:05:52 INFO ipc.NettyTransceiver: [id: 0x002c17f7, / >>>>> 67.195.138.60:58915 :> janus.apache.org/67.195.138.60:61001] CLOSED >>>>> 12/08/17 23:05:52 INFO ipc.NettyTransceiver: Remote peer >>>>> janus.apache.org/67.195.138.60:61001 closed connection. >>>>> 12/08/17 23:05:52 INFO ipc.NettyTransceiver: Disconnecting from >>>>> janus.apache.org/67.195.138.60:61001 >>>>> >>>>> 12/08/17 23:05:52 ERROR bsp.BSPTask: Shutting down ping service. >>>>> >>>>> 12/08/17 23:05:52 FATAL bsp.GroomServer: Error running child >>>>> java.lang.NullPointerException >>>>> at >>>>> org.apache.hama.graph.GraphJobRunner.parseMessages(GraphJobRunner.java:373) >>>>> at >>>>> org.apache.hama.graph.GraphJobRunner.bsp(GraphJobRunner.java:209) >>>>> at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:166) >>>>> at org.apache.hama.bsp.BSPTask.run(BSPTask.java:143) >>>>> at >>>>> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1271) >>>>> >>>>> java.lang.NullPointerException >>>>> at >>>>> org.apache.hama.graph.GraphJobRunner.parseMessages(GraphJobRunner.java:373) >>>>> at >>>>> org.apache.hama.graph.GraphJobRunner.bsp(GraphJobRunner.java:209) >>>>> at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:166) >>>>> at org.apache.hama.bsp.BSPTask.run(BSPTask.java:143) >>>>> at >>>>> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1271) >>>>> 12/08/17 23:05:53 INFO bsp.BSPJobClient: Current supersteps number: 0 >>>>> >>>>> 12/08/17 23:05:56 INFO bsp.BSPJobClient: Current supersteps number: 52 >>>>> >>>>> 12/08/17 23:06:00 INFO bsp.GroomServer: adding purge task: >>>>> attempt_201208172305_0001_000001_0 >>>>> >>>>> 12/08/17 23:06:00 INFO bsp.GroomServer: About to purge task: >>>>> attempt_201208172305_0001_000001_0 >>>>> >>>>> 12/08/17 23:06:00 INFO ipc.NettyServer: [id: 0x00b6d6ab, / >>>>> 67.195.138.60:46475 :> /67.195.138.60:61002] DISCONNECTED >>>>> >>>>> 12/08/17 23:06:00 INFO ipc.NettyServer: [id: 0x00b6d6ab, / >>>>> 67.195.138.60:46475 :> /67.195.138.60:61002] UNBOUND >>>>> >>>>> 12/08/17 23:06:00 INFO ipc.NettyServer: [id: 0x00b6d6ab, / >>>>> 67.195.138.60:46475 :> /67.195.138.60:61002] CLOSED >>>>> >>>>> 12/08/17 23:06:01 INFO bsp.JobInProgress: Taskid >>>>> 'attempt_201208172305_0001_000001_0' has failed. >>>>> >>>>> 12/08/17 23:06:01 INFO bsp.TaskInProgress: Task >>>>> 'task_201208172305_0001_000001' has failed. >>>>> >>>>> 12/08/17 23:06:01 INFO bsp.JobInProgress: Job failed. >>>>> >>>>> 12/08/17 23:06:01 INFO bsp.GroomServer: Kill 1 tasks. >>>>> >>>>> 12/08/17 23:06:01 INFO bsp.GroomServer: Kill 1 tasks. >>>>> >>>>> 12/08/17 23:06:01 WARN server.NIOServerCnxn: EndOfStreamException: >>>>> Unable to read additional data from client sessionid 0x13936d5d8cc0003, >>>>> likely client has closed socket >>>>> >>>>> 12/08/17 23:06:01 INFO server.NIOServerCnxn: Closed socket connection >>>>> for client /0:0:0:0:0:0:0:1:37318 which had sessionid 0x13936d5d8cc0003 >>>>> >>>>> 12/08/17 23:06:02 WARN mortbay.log: /tasklog: java.io.IOException: >>>>> Closed >>>>> >>>>> 12/08/17 23:06:02 WARN bsp.BSPJobClient: Error reading task >>>>> outputhttp:// >>>>> janus.apache.org:40015/tasklog?plaintext=true&taskid=attempt_201208172305_0001_000001_0&filter=stdout >>>>> >>>>> 12/08/17 23:06:02 INFO bsp.BSPJobClient: Job failed. >>>>> >>>>> 12/08/17 23:06:02 INFO server.PrepRequestProcessor: Processed session >>>>> termination for sessionid: 0x13936d5d8cc0000 >>>>> >>>>> 12/08/17 23:06:02 INFO zookeeper.ZooKeeper: Session: 0x13936d5d8cc0000 >>>>> closed >>>>> >>>>> 12/08/17 23:06:02 INFO zookeeper.ClientCnxn: EventThread shut down >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: Stopping server on 40000 >>>>> >>>>> 12/08/17 23:06:02 INFO server.NIOServerCnxn: Closed socket connection >>>>> for client /127.0.0.1:54084 which had sessionid 0x13936d5d8cc0000 >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 0 on 40000: >>>>> exiting >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: Stopping IPC Server listener on 40000 >>>>> >>>>> 12/08/17 23:06:02 INFO metrics.RpcInstrumentation: shut down >>>>> >>>>> 12/08/17 23:06:02 INFO bsp.BSPMaster: Stopped RPC Master server. >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: Stopping IPC Server Responder >>>>> >>>>> 12/08/17 23:06:02 INFO server.PrepRequestProcessor: Processed session >>>>> termination for sessionid: 0x13936d5d8cc0001 >>>>> >>>>> 12/08/17 23:06:02 INFO server.NIOServerCnxn: Closed socket connection >>>>> for client /127.0.0.1:54088 which had sessionid 0x13936d5d8cc0001 >>>>> >>>>> 12/08/17 23:06:02 INFO zookeeper.ClientCnxn: EventThread shut down >>>>> >>>>> 12/08/17 23:06:02 INFO zookeeper.ZooKeeper: Session: 0x13936d5d8cc0001 >>>>> closed >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: Stopping server on 40440 >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: Stopping IPC Server listener on 40440 >>>>> >>>>> 12/08/17 23:06:02 INFO metrics.RpcInstrumentation: shut down >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: Stopping IPC Server Responder >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: Stopping server on 48980 >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 0 on 40440: >>>>> exiting >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 2 on 48980: >>>>> exiting >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 1 on 48980: >>>>> exiting >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 0 on 48980: >>>>> exiting >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: Stopping IPC Server Responder >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: Stopping IPC Server listener on 48980 >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 5 on 48980: >>>>> exiting >>>>> >>>>> 12/08/17 23:06:02 INFO metrics.RpcInstrumentation: shut down >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 7 on 48980: >>>>> exiting >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 3 on 48980: >>>>> exiting >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 6 on 48980: >>>>> exiting >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 9 on 48980: >>>>> exiting >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 4 on 48980: >>>>> exiting >>>>> >>>>> 12/08/17 23:06:02 INFO server.NIOServerCnxn: NIOServerCnxn factory >>>>> exited run method >>>>> >>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 8 on 48980: >>>>> exiting >>>>> >>>>> 12/08/17 23:06:02 INFO server.PrepRequestProcessor: >>>>> PrepRequestProcessor exited loop! >>>>> >>>>> 12/08/17 23:06:02 INFO server.SyncRequestProcessor: >>>>> SyncRequestProcessor exited! >>>>> >>>>> 12/08/17 23:06:02 INFO server.FinalRequestProcessor: shutdown of >>>>> request processor complete >>>>> >>>>> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.028 >>>>> sec <<< FAILURE! >>>>> >>>>> Results : >>>>> >>>>> Failed tests: >>>>> testSubmitJob(org.apache.hama.graph.TestSubmitGraphJob) >>>>> >>>>> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0 >>>>> >>>>> [INFO] >>>>> ------------------------------------------------------------------------ >>>>> [INFO] Reactor Summary: >>>>> [INFO] >>>>> [INFO] Apache Hama parent POM ............................ SUCCESS >>>>> [5.294s] >>>>> [INFO] core .............................................. SUCCESS >>>>> [1:49.431s] >>>>> [INFO] graph ............................................. FAILURE >>>>> [17.399s] >>>>> [INFO] examples .......................................... SKIPPED >>>>> [INFO] yarn .............................................. SKIPPED >>>>> [INFO] machine learning .................................. SKIPPED >>>>> [INFO] hama-dist ......................................... SKIPPED >>>>> [INFO] >>>>> ------------------------------------------------------------------------ >>>>> [INFO] BUILD FAILURE >>>>> [INFO] >>>>> ------------------------------------------------------------------------ >>>>> [INFO] Total time: 2:13.208s >>>>> [INFO] Finished at: Fri Aug 17 23:06:02 UTC 2012 >>>>> [INFO] Final Memory: 40M/191M >>>>> [INFO] >>>>> ------------------------------------------------------------------------ >>>>> [ERROR] Failed to execute goal >>>>> org.apache.maven.plugins:maven-surefire-plugin:2.6:test (default-test) on >>>>> project hama-graph: There are test failures. >>>>> [ERROR] >>>>> [ERROR] Please refer to < >>>>> https://builds.apache.org/job/Hama-Nightly/ws/trunk/graph/target/surefire-reports> >>>>> for the individual test results. >>>>> [ERROR] -> [Help 1] >>>>> [ERROR] >>>>> [ERROR] To see the full stack trace of the errors, re-run Maven with >>>>> the -e switch. >>>>> [ERROR] Re-run Maven using the -X switch to enable full debug logging. >>>>> [ERROR] >>>>> [ERROR] For more information about the errors and possible solutions, >>>>> please read the following articles: >>>>> [ERROR] [Help 1] >>>>> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException >>>>> [ERROR] >>>>> [ERROR] After correcting the problems, you can resume the build with >>>>> the command >>>>> [ERROR] mvn <goals> -rf :hama-graph >>>>> Build step 'Invoke top-level Maven targets' marked build as failure >>>>> Archiving artifacts >>>>> Recording test results >>>>> >>>> >>>> >>> >> -- Best Regards, Edward J. Yoon @eddieyoon
