Hello Terence, Thanks for getting back to me. So it does seem to be some timing issue. I've updated my gist (https://gist.github.com/erickt/7b16d695b64384015b41) to add some extra controls and a driver script. If I do a sleep of `0`, it will sometimes work, and sometimes exit before all the `"done"` messages have been sent. However, If I do a sleep of `10`, it seems like I get all the messages.
On Mon, May 12, 2014 at 10:39 AM, Terence Yim <[email protected]> wrote: > Hi Erick, > > So all logs from node3 are entirely missing from the driver view? > That's sounds something wrong to me. > > In Twill, it'll always try to flush the last bit of logs when a > container shutdown, however if that flush failed, there would be no > retry, since we don't want to hang the stopping of application. > > However in your case, it seems like you get no log from a particular > node. Have you try to add couple seconds sleep in your Runnable.run() > before return and see if it is due to shutdown issue or something > else? > > Terence > > On Fri, May 9, 2014 at 5:44 PM, Erick Tryzelaar > <[email protected]> wrote: > > Good evening, > > > > I'm not quite sure if this is a bug in my code or Twill. I've been > working > > away on TWILL-78, but I'm running into some basic issues with me not > > seeming to receive all the log messages from my containers. You can find > a > > copy of my toy application here: > > https://gist.github.com/erickt/7b16d695b64384015b41. I've been testing > > twill with 3 containers. Occasionally I get everything I expect, but > > sometimes I only seem to get a subset of the log message on the worker > > nodes. Here's an example. While in /var/log/hadoop-yarn/container/... on > > the nodes have: > > > > node1: > > ... > > Launching main: public static void > > > org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[]) > > throws java.lang.Exception [] > > 2014-05-10 00:20:09,334 - ERROR > > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] - > > entering barrier > > 2014-05-10 00:20:09,352 - WARN > > [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There > are > > no ConnectionStateListeners registered. > > 2014-05-10 00:20:11,187 - ERROR > > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in > > barrier > > 2014-05-10 00:20:11,188 - ERROR > > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] - > woken up > > 2014-05-10 00:20:11,830 - ERROR > > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out > of > > barrier > > 2014-05-10 00:20:11,831 - ERROR > > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done > > Main class completed. > > Launcher completed > > Cleanup directory tmp/twill.launcher-1399681206035-0 > > ---- > > > > node2: > > ... > > Launching main: public static void > > > org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[]) > > throws java.lang.Exception [] > > 2014-05-10 00:20:00,133 - ERROR > > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] - > > entering barrier > > 2014-05-10 00:20:00,158 - WARN > > [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There > are > > no ConnectionStateListeners registered. > > 2014-05-10 00:20:02,161 - ERROR > > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in > > barrier > > 2014-05-10 00:20:02,161 - ERROR > > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] - > woken up > > 2014-05-10 00:20:02,979 - ERROR > > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out > of > > barrier > > 2014-05-10 00:20:02,979 - ERROR > > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done > > Main class completed. > > Launcher completed > > Cleanup directory tmp/twill.launcher-1399681196232-0 > > ---- > > > > node3: > > ... > > Launching main: public static void > > > org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[]) > > throws java.lang.Exception [] > > 2014-05-10 00:20:01,191 - ERROR > > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] - > > entering barrier > > 2014-05-10 00:20:01,197 - WARN > > [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There > are > > no ConnectionStateListeners registered. > > 2014-05-10 00:20:02,768 - ERROR > > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in > > barrier > > 2014-05-10 00:20:02,769 - ERROR > > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] - > woken up > > 2014-05-10 00:20:03,587 - ERROR > > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out > of > > barrier > > 2014-05-10 00:20:03,588 - ERROR > > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done > > Main class completed. > > Launcher completed > > Cleanup directory tmp/twill.launcher-1399681197698-0 > > ---- > > > > But the driver script in this case only shows the output from 2 nodes: > > > > ---- > > 2014-05-09 17:19:15,137 - WARN [main:o.a.h.u.NativeCodeLoader@62] - > Unable > > to load native-hadoop library for your platform... using builtin-java > > classes where applicable > > 2014-05-09 17:19:16,297 - ERROR [main:o.l.g.t.GraphlabApplication@136] - > > before getting completion > > 2014-05-10T00:20:00,133Z ERROR o.l.g.t.GraphlabApplication [node1] > > [ServiceDelegate] > > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:57) - > > entering barrier > > 2014-05-10T00:20:09,334Z ERROR o.l.g.t.GraphlabApplication [node2] > > [ServiceDelegate] > > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:57) - > > entering barrier > > 2014-05-10T00:20:09,352Z WARN o.a.c.f.s.ConnectionStateManager [node2] > > [ConnectionStateManager-0] > > ConnectionStateManager:processEvents(ConnectionStateManager.java:212) - > > There are no ConnectionStateListeners registered. > > 2014-05-10T00:20:11,187Z ERROR o.l.g.t.GraphlabApplication [node2] > > [ServiceDelegate] > > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:66) - > in > > barrier > > 2014-05-10T00:20:11,188Z ERROR o.l.g.t.GraphlabApplication [node2] > > [ServiceDelegate] > > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:76) - > > woken up > > 2014-05-10T00:20:11,830Z ERROR o.l.g.t.GraphlabApplication [node2] > > [ServiceDelegate] > > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:83) - > out > > of barrier > > 2014-05-10T00:20:11,831Z ERROR o.l.g.t.GraphlabApplication [node2] > > [ServiceDelegate] > > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:93) - > done > > 2014-05-10T00:20:00,158Z WARN o.a.c.f.s.ConnectionStateManager [node1] > > [ConnectionStateManager-0] > > ConnectionStateManager:processEvents(ConnectionStateManager.java:212) - > > There are no ConnectionStateListeners registered. > > 2014-05-10T00:20:02,161Z ERROR o.l.g.t.GraphlabApplication [node1] > > [ServiceDelegate] > > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:66) - > in > > barrier > > 2014-05-10T00:20:02,161Z ERROR o.l.g.t.GraphlabApplication [node1] > > [ServiceDelegate] > > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:76) - > > woken up > > 2014-05-10T00:20:02,979Z ERROR o.l.g.t.GraphlabApplication [node1] > > [ServiceDelegate] > > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:83) - > out > > of barrier > > 2014-05-10T00:20:02,979Z ERROR o.l.g.t.GraphlabApplication [node1] > > [ServiceDelegate] > > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:93) - > done > > 14/05/09 17:20:13 INFO consumer.SimpleConsumer: Reconnect due to socket > > error: Connection reset by peer > > 2014-05-09 17:20:13,310 - ERROR [main:o.l.g.t.GraphlabApplication@144] - > > after shutting down > > 2014-05-09 17:20:13,312 - ERROR > [Thread-3:o.l.g.t.GraphlabApplication$1@130] > > - shutting down > > --- > > > > So is this something I'm doing wrong? Or is Twill or Kafka somehow > shutting > > down before all the messages have been sent? > > > > Thanks for any help, > > -Erick >
