Good evening, I'm not quite sure if this is a bug in my code or Twill. I've been working away on TWILL-78, but I'm running into some basic issues with me not seeming to receive all the log messages from my containers. You can find a copy of my toy application here: https://gist.github.com/erickt/7b16d695b64384015b41. I've been testing twill with 3 containers. Occasionally I get everything I expect, but sometimes I only seem to get a subset of the log message on the worker nodes. Here's an example. While in /var/log/hadoop-yarn/container/... on the nodes have:
node1: ... Launching main: public static void org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[]) throws java.lang.Exception [] 2014-05-10 00:20:09,334 - ERROR [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] - entering barrier 2014-05-10 00:20:09,352 - WARN [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There are no ConnectionStateListeners registered. 2014-05-10 00:20:11,187 - ERROR [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in barrier 2014-05-10 00:20:11,188 - ERROR [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] - woken up 2014-05-10 00:20:11,830 - ERROR [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out of barrier 2014-05-10 00:20:11,831 - ERROR [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done Main class completed. Launcher completed Cleanup directory tmp/twill.launcher-1399681206035-0 ---- node2: ... Launching main: public static void org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[]) throws java.lang.Exception [] 2014-05-10 00:20:00,133 - ERROR [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] - entering barrier 2014-05-10 00:20:00,158 - WARN [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There are no ConnectionStateListeners registered. 2014-05-10 00:20:02,161 - ERROR [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in barrier 2014-05-10 00:20:02,161 - ERROR [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] - woken up 2014-05-10 00:20:02,979 - ERROR [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out of barrier 2014-05-10 00:20:02,979 - ERROR [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done Main class completed. Launcher completed Cleanup directory tmp/twill.launcher-1399681196232-0 ---- node3: ... Launching main: public static void org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[]) throws java.lang.Exception [] 2014-05-10 00:20:01,191 - ERROR [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] - entering barrier 2014-05-10 00:20:01,197 - WARN [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There are no ConnectionStateListeners registered. 2014-05-10 00:20:02,768 - ERROR [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in barrier 2014-05-10 00:20:02,769 - ERROR [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] - woken up 2014-05-10 00:20:03,587 - ERROR [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out of barrier 2014-05-10 00:20:03,588 - ERROR [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done Main class completed. Launcher completed Cleanup directory tmp/twill.launcher-1399681197698-0 ---- But the driver script in this case only shows the output from 2 nodes: ---- 2014-05-09 17:19:15,137 - WARN [main:o.a.h.u.NativeCodeLoader@62] - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-05-09 17:19:16,297 - ERROR [main:o.l.g.t.GraphlabApplication@136] - before getting completion 2014-05-10T00:20:00,133Z ERROR o.l.g.t.GraphlabApplication [node1] [ServiceDelegate] GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:57) - entering barrier 2014-05-10T00:20:09,334Z ERROR o.l.g.t.GraphlabApplication [node2] [ServiceDelegate] GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:57) - entering barrier 2014-05-10T00:20:09,352Z WARN o.a.c.f.s.ConnectionStateManager [node2] [ConnectionStateManager-0] ConnectionStateManager:processEvents(ConnectionStateManager.java:212) - There are no ConnectionStateListeners registered. 2014-05-10T00:20:11,187Z ERROR o.l.g.t.GraphlabApplication [node2] [ServiceDelegate] GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:66) - in barrier 2014-05-10T00:20:11,188Z ERROR o.l.g.t.GraphlabApplication [node2] [ServiceDelegate] GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:76) - woken up 2014-05-10T00:20:11,830Z ERROR o.l.g.t.GraphlabApplication [node2] [ServiceDelegate] GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:83) - out of barrier 2014-05-10T00:20:11,831Z ERROR o.l.g.t.GraphlabApplication [node2] [ServiceDelegate] GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:93) - done 2014-05-10T00:20:00,158Z WARN o.a.c.f.s.ConnectionStateManager [node1] [ConnectionStateManager-0] ConnectionStateManager:processEvents(ConnectionStateManager.java:212) - There are no ConnectionStateListeners registered. 2014-05-10T00:20:02,161Z ERROR o.l.g.t.GraphlabApplication [node1] [ServiceDelegate] GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:66) - in barrier 2014-05-10T00:20:02,161Z ERROR o.l.g.t.GraphlabApplication [node1] [ServiceDelegate] GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:76) - woken up 2014-05-10T00:20:02,979Z ERROR o.l.g.t.GraphlabApplication [node1] [ServiceDelegate] GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:83) - out of barrier 2014-05-10T00:20:02,979Z ERROR o.l.g.t.GraphlabApplication [node1] [ServiceDelegate] GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:93) - done 14/05/09 17:20:13 INFO consumer.SimpleConsumer: Reconnect due to socket error: Connection reset by peer 2014-05-09 17:20:13,310 - ERROR [main:o.l.g.t.GraphlabApplication@144] - after shutting down 2014-05-09 17:20:13,312 - ERROR [Thread-3:o.l.g.t.GraphlabApplication$1@130] - shutting down --- So is this something I'm doing wrong? Or is Twill or Kafka somehow shutting down before all the messages have been sent? Thanks for any help, -Erick
