Our messaging service is reliable? It'd be nice if we add some
validation unit test.

On Sun, Aug 19, 2012 at 12:22 AM, Thomas Jungblut
<[email protected]> wrote:
> And it does not fail in rev 1350261.
>
> 2012/8/18 Thomas Jungblut <[email protected]>
>
>> also happens in local mode. but the testcase had turned it off.
>>
>>
>> 2012/8/18 Thomas Jungblut <[email protected]>
>>
>>> Okay still no idea.
>>>
>>> I know that this happens when the master tasks sends a message with the
>>> updated aggregator values to every slave. (line 239). Then this only
>>> message should be consumed arround line 246 cc.
>>>
>>> But it still remains in the buffer and will be consumed after all
>>> computation in line 209 in the parseMessages.
>>> Even clearing the buffer is not fixing it.
>>>
>>> I should rewrite this whole crapyard to the new superstep API. -.-
>>>
>>>
>>> 2012/8/18 Thomas Jungblut <[email protected]>
>>>
>>>> Okay two possible problems:
>>>> - It changes the host because of fault tolerance (contra arguments: its
>>>> turned off and the port is smaller than the other one)
>>>> - Messaging is broken (would also explain why pagerank does not converge
>>>> anymore)
>>>>
>>>> Can anyone else reproduce this scenario?
>>>>
>>>> 2012/8/18 Apache Jenkins Server <[email protected]>
>>>>
>>>>> See <https://builds.apache.org/job/Hama-Nightly/645/changes>
>>>>>
>>>>> Changes:
>>>>>
>>>>> [tjungblut] add missing class
>>>>>
>>>>> ------------------------------------------
>>>>> [...truncated 4232 lines...]
>>>>>
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 4
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true
>>>>> janus.apache.org:61002
>>>>>
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
>>>>>
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 4
>>>>>
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true
>>>>> janus.apache.org:61002
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 7
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true
>>>>> janus.apache.org:61002
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++
>>>>> VAL=0.4572019638123739
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 7
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true
>>>>> janus.apache.org:61002
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++
>>>>> VAL=0.44247448197562855
>>>>>
>>>>> 12/08/17 23:05:52 INFO server.PrepRequestProcessor: Got user-level
>>>>> KeeperException when processing sessionid:0x13936d5d8cc0002 type:create
>>>>> cxid:0x3de zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error
>>>>> Path:/bsp/job_201208172305_0001/sync/51 Error:KeeperErrorCode = NodeExists
>>>>> for /bsp/job_201208172305_0001/sync/51
>>>>>
>>>>> 12/08/17 23:05:52 INFO server.PrepRequestProcessor: Got user-level
>>>>> KeeperException when processing sessionid:0x13936d5d8cc0002 type:create
>>>>> cxid:0x3e8 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error
>>>>> Path:/bsp/job_201208172305_0001/sync/51/ready Error:KeeperErrorCode =
>>>>> NodeExists for /bsp/job_201208172305_0001/sync/51/ready
>>>>>
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 4
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true
>>>>> janus.apache.org:61002
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
>>>>>
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 4
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true
>>>>> janus.apache.org:61002
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 7
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true
>>>>> janus.apache.org:61002
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++
>>>>> VAL=0.457610574551534
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 7
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true
>>>>> janus.apache.org:61002
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++
>>>>> VAL=0.2675231554874198
>>>>>
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 11
>>>>> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ NULL! false
>>>>> janus.apache.org:61001
>>>>>
>>>>> 12/08/17 23:05:52 ERROR bsp.BSPTask: Error running bsp setup and bsp
>>>>> function.
>>>>> java.lang.NullPointerException
>>>>>         at
>>>>> org.apache.hama.graph.GraphJobRunner.parseMessages(GraphJobRunner.java:373)
>>>>>         at
>>>>> org.apache.hama.graph.GraphJobRunner.bsp(GraphJobRunner.java:209)
>>>>>         at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:166)
>>>>>         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:143)
>>>>>         at
>>>>> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1271)
>>>>>
>>>>> 12/08/17 23:05:52 INFO server.PrepRequestProcessor: Processed session
>>>>> termination for sessionid: 0x13936d5d8cc0002
>>>>>
>>>>> 12/08/17 23:05:52 INFO zookeeper.ZooKeeper: Session: 0x13936d5d8cc0002
>>>>> closed
>>>>>
>>>>> 12/08/17 23:05:52 INFO zookeeper.ClientCnxn: EventThread shut down
>>>>>
>>>>> 12/08/17 23:05:52 INFO server.NIOServerCnxn: Closed socket connection
>>>>> for client /127.0.0.1:54114 which had sessionid 0x13936d5d8cc0002
>>>>>
>>>>> 12/08/17 23:05:52 INFO ipc.NettyServer: [id: 0x001cb7a1, /
>>>>> 67.195.138.60:58913 :> /67.195.138.60:61001] DISCONNECTED
>>>>>
>>>>> 12/08/17 23:05:52 INFO ipc.NettyTransceiver: [id: 0x0136d9d8, /
>>>>> 67.195.138.60:58913 :> janus.apache.org/67.195.138.60:61001]
>>>>> DISCONNECTED
>>>>>
>>>>> 12/08/17 23:05:52 INFO ipc.NettyTransceiver: [id: 0x0136d9d8, /
>>>>> 67.195.138.60:58913 :> janus.apache.org/67.195.138.60:61001] UNBOUND
>>>>>
>>>>> 12/08/17 23:05:52 INFO ipc.NettyServer: [id: 0x001cb7a1, /
>>>>> 67.195.138.60:58913 :> /67.195.138.60:61001] UNBOUND
>>>>>
>>>>> 12/08/17 23:05:52 INFO ipc.NettyTransceiver: [id: 0x0136d9d8, /
>>>>> 67.195.138.60:58913 :> janus.apache.org/67.195.138.60:61001] CLOSED
>>>>>
>>>>> 12/08/17 23:05:52 INFO ipc.NettyServer: [id: 0x001cb7a1, /
>>>>> 67.195.138.60:58913 :> /67.195.138.60:61001] CLOSED
>>>>>
>>>>> 12/08/17 23:05:52 INFO ipc.NettyTransceiver: Remote peer
>>>>> janus.apache.org/67.195.138.60:61001 closed connection.
>>>>>
>>>>> 12/08/17 23:05:52 INFO ipc.NettyServer: [id: 0x01bf7b23, /
>>>>> 67.195.138.60:58915 :> /67.195.138.60:61001] DISCONNECTED
>>>>>
>>>>> 12/08/17 23:05:52 INFO ipc.NettyTransceiver: Disconnecting from
>>>>> janus.apache.org/67.195.138.60:61001
>>>>>
>>>>> 12/08/17 23:05:52 INFO ipc.NettyServer: [id: 0x01bf7b23, /
>>>>> 67.195.138.60:58915 :> /67.195.138.60:61001] UNBOUND
>>>>>
>>>>> 12/08/17 23:05:52 INFO ipc.NettyTransceiver: [id: 0x002c17f7, /
>>>>> 67.195.138.60:58915 :> janus.apache.org/67.195.138.60:61001]
>>>>> DISCONNECTED
>>>>>
>>>>> 12/08/17 23:05:52 INFO ipc.NettyServer: [id: 0x01bf7b23, /
>>>>> 67.195.138.60:58915 :> /67.195.138.60:61001] CLOSED
>>>>>
>>>>> 12/08/17 23:05:52 INFO ipc.NettyTransceiver: [id: 0x002c17f7, /
>>>>> 67.195.138.60:58915 :> janus.apache.org/67.195.138.60:61001] UNBOUND
>>>>> 12/08/17 23:05:52 INFO ipc.NettyTransceiver: [id: 0x002c17f7, /
>>>>> 67.195.138.60:58915 :> janus.apache.org/67.195.138.60:61001] CLOSED
>>>>> 12/08/17 23:05:52 INFO ipc.NettyTransceiver: Remote peer
>>>>> janus.apache.org/67.195.138.60:61001 closed connection.
>>>>> 12/08/17 23:05:52 INFO ipc.NettyTransceiver: Disconnecting from
>>>>> janus.apache.org/67.195.138.60:61001
>>>>>
>>>>> 12/08/17 23:05:52 ERROR bsp.BSPTask: Shutting down ping service.
>>>>>
>>>>> 12/08/17 23:05:52 FATAL bsp.GroomServer: Error running child
>>>>> java.lang.NullPointerException
>>>>>         at
>>>>> org.apache.hama.graph.GraphJobRunner.parseMessages(GraphJobRunner.java:373)
>>>>>         at
>>>>> org.apache.hama.graph.GraphJobRunner.bsp(GraphJobRunner.java:209)
>>>>>         at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:166)
>>>>>         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:143)
>>>>>         at
>>>>> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1271)
>>>>>
>>>>> java.lang.NullPointerException
>>>>>         at
>>>>> org.apache.hama.graph.GraphJobRunner.parseMessages(GraphJobRunner.java:373)
>>>>>         at
>>>>> org.apache.hama.graph.GraphJobRunner.bsp(GraphJobRunner.java:209)
>>>>>         at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:166)
>>>>>         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:143)
>>>>>         at
>>>>> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1271)
>>>>> 12/08/17 23:05:53 INFO bsp.BSPJobClient: Current supersteps number: 0
>>>>>
>>>>> 12/08/17 23:05:56 INFO bsp.BSPJobClient: Current supersteps number: 52
>>>>>
>>>>> 12/08/17 23:06:00 INFO bsp.GroomServer: adding purge task:
>>>>> attempt_201208172305_0001_000001_0
>>>>>
>>>>> 12/08/17 23:06:00 INFO bsp.GroomServer: About to purge task:
>>>>> attempt_201208172305_0001_000001_0
>>>>>
>>>>> 12/08/17 23:06:00 INFO ipc.NettyServer: [id: 0x00b6d6ab, /
>>>>> 67.195.138.60:46475 :> /67.195.138.60:61002] DISCONNECTED
>>>>>
>>>>> 12/08/17 23:06:00 INFO ipc.NettyServer: [id: 0x00b6d6ab, /
>>>>> 67.195.138.60:46475 :> /67.195.138.60:61002] UNBOUND
>>>>>
>>>>> 12/08/17 23:06:00 INFO ipc.NettyServer: [id: 0x00b6d6ab, /
>>>>> 67.195.138.60:46475 :> /67.195.138.60:61002] CLOSED
>>>>>
>>>>> 12/08/17 23:06:01 INFO bsp.JobInProgress: Taskid
>>>>> 'attempt_201208172305_0001_000001_0' has failed.
>>>>>
>>>>> 12/08/17 23:06:01 INFO bsp.TaskInProgress: Task
>>>>> 'task_201208172305_0001_000001' has failed.
>>>>>
>>>>> 12/08/17 23:06:01 INFO bsp.JobInProgress: Job failed.
>>>>>
>>>>> 12/08/17 23:06:01 INFO bsp.GroomServer: Kill 1 tasks.
>>>>>
>>>>> 12/08/17 23:06:01 INFO bsp.GroomServer: Kill 1 tasks.
>>>>>
>>>>> 12/08/17 23:06:01 WARN server.NIOServerCnxn: EndOfStreamException:
>>>>> Unable to read additional data from client sessionid 0x13936d5d8cc0003,
>>>>> likely client has closed socket
>>>>>
>>>>> 12/08/17 23:06:01 INFO server.NIOServerCnxn: Closed socket connection
>>>>> for client /0:0:0:0:0:0:0:1:37318 which had sessionid 0x13936d5d8cc0003
>>>>>
>>>>> 12/08/17 23:06:02 WARN mortbay.log: /tasklog: java.io.IOException:
>>>>> Closed
>>>>>
>>>>> 12/08/17 23:06:02 WARN bsp.BSPJobClient: Error reading task
>>>>> outputhttp://
>>>>> janus.apache.org:40015/tasklog?plaintext=true&taskid=attempt_201208172305_0001_000001_0&filter=stdout
>>>>>
>>>>> 12/08/17 23:06:02 INFO bsp.BSPJobClient: Job failed.
>>>>>
>>>>> 12/08/17 23:06:02 INFO server.PrepRequestProcessor: Processed session
>>>>> termination for sessionid: 0x13936d5d8cc0000
>>>>>
>>>>> 12/08/17 23:06:02 INFO zookeeper.ZooKeeper: Session: 0x13936d5d8cc0000
>>>>> closed
>>>>>
>>>>> 12/08/17 23:06:02 INFO zookeeper.ClientCnxn: EventThread shut down
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: Stopping server on 40000
>>>>>
>>>>> 12/08/17 23:06:02 INFO server.NIOServerCnxn: Closed socket connection
>>>>> for client /127.0.0.1:54084 which had sessionid 0x13936d5d8cc0000
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 0 on 40000:
>>>>> exiting
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: Stopping IPC Server listener on 40000
>>>>>
>>>>> 12/08/17 23:06:02 INFO metrics.RpcInstrumentation: shut down
>>>>>
>>>>> 12/08/17 23:06:02 INFO bsp.BSPMaster: Stopped RPC Master server.
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: Stopping IPC Server Responder
>>>>>
>>>>> 12/08/17 23:06:02 INFO server.PrepRequestProcessor: Processed session
>>>>> termination for sessionid: 0x13936d5d8cc0001
>>>>>
>>>>> 12/08/17 23:06:02 INFO server.NIOServerCnxn: Closed socket connection
>>>>> for client /127.0.0.1:54088 which had sessionid 0x13936d5d8cc0001
>>>>>
>>>>> 12/08/17 23:06:02 INFO zookeeper.ClientCnxn: EventThread shut down
>>>>>
>>>>> 12/08/17 23:06:02 INFO zookeeper.ZooKeeper: Session: 0x13936d5d8cc0001
>>>>> closed
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: Stopping server on 40440
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: Stopping IPC Server listener on 40440
>>>>>
>>>>> 12/08/17 23:06:02 INFO metrics.RpcInstrumentation: shut down
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: Stopping IPC Server Responder
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: Stopping server on 48980
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 0 on 40440:
>>>>> exiting
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 2 on 48980:
>>>>> exiting
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 1 on 48980:
>>>>> exiting
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 0 on 48980:
>>>>> exiting
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: Stopping IPC Server Responder
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: Stopping IPC Server listener on 48980
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 5 on 48980:
>>>>> exiting
>>>>>
>>>>> 12/08/17 23:06:02 INFO metrics.RpcInstrumentation: shut down
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 7 on 48980:
>>>>> exiting
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 3 on 48980:
>>>>> exiting
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 6 on 48980:
>>>>> exiting
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 9 on 48980:
>>>>> exiting
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 4 on 48980:
>>>>> exiting
>>>>>
>>>>> 12/08/17 23:06:02 INFO server.NIOServerCnxn: NIOServerCnxn factory
>>>>> exited run method
>>>>>
>>>>> 12/08/17 23:06:02 INFO ipc.Server: IPC Server handler 8 on 48980:
>>>>> exiting
>>>>>
>>>>> 12/08/17 23:06:02 INFO server.PrepRequestProcessor:
>>>>> PrepRequestProcessor exited loop!
>>>>>
>>>>> 12/08/17 23:06:02 INFO server.SyncRequestProcessor:
>>>>> SyncRequestProcessor exited!
>>>>>
>>>>> 12/08/17 23:06:02 INFO server.FinalRequestProcessor: shutdown of
>>>>> request processor complete
>>>>>
>>>>> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.028
>>>>> sec <<< FAILURE!
>>>>>
>>>>> Results :
>>>>>
>>>>> Failed tests:
>>>>>   testSubmitJob(org.apache.hama.graph.TestSubmitGraphJob)
>>>>>
>>>>> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0
>>>>>
>>>>> [INFO]
>>>>> ------------------------------------------------------------------------
>>>>> [INFO] Reactor Summary:
>>>>> [INFO]
>>>>> [INFO] Apache Hama parent POM ............................ SUCCESS
>>>>> [5.294s]
>>>>> [INFO] core .............................................. SUCCESS
>>>>> [1:49.431s]
>>>>> [INFO] graph ............................................. FAILURE
>>>>> [17.399s]
>>>>> [INFO] examples .......................................... SKIPPED
>>>>> [INFO] yarn .............................................. SKIPPED
>>>>> [INFO] machine learning .................................. SKIPPED
>>>>> [INFO] hama-dist ......................................... SKIPPED
>>>>> [INFO]
>>>>> ------------------------------------------------------------------------
>>>>> [INFO] BUILD FAILURE
>>>>> [INFO]
>>>>> ------------------------------------------------------------------------
>>>>> [INFO] Total time: 2:13.208s
>>>>> [INFO] Finished at: Fri Aug 17 23:06:02 UTC 2012
>>>>> [INFO] Final Memory: 40M/191M
>>>>> [INFO]
>>>>> ------------------------------------------------------------------------
>>>>> [ERROR] Failed to execute goal
>>>>> org.apache.maven.plugins:maven-surefire-plugin:2.6:test (default-test) on
>>>>> project hama-graph: There are test failures.
>>>>> [ERROR]
>>>>> [ERROR] Please refer to <
>>>>> https://builds.apache.org/job/Hama-Nightly/ws/trunk/graph/target/surefire-reports>
>>>>> for the individual test results.
>>>>> [ERROR] -> [Help 1]
>>>>> [ERROR]
>>>>> [ERROR] To see the full stack trace of the errors, re-run Maven with
>>>>> the -e switch.
>>>>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>>>>> [ERROR]
>>>>> [ERROR] For more information about the errors and possible solutions,
>>>>> please read the following articles:
>>>>> [ERROR] [Help 1]
>>>>> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
>>>>> [ERROR]
>>>>> [ERROR] After correcting the problems, you can resume the build with
>>>>> the command
>>>>> [ERROR]   mvn <goals> -rf :hama-graph
>>>>> Build step 'Invoke top-level Maven targets' marked build as failure
>>>>> Archiving artifacts
>>>>> Recording test results
>>>>>
>>>>
>>>>
>>>
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Reply via email to