Hi Erick, That sounds like bug to me that the Twill log appender is not doing proper flush when the Runnable terminates. I've created a ticket for the investigation and fix.
https://issues.apache.org/jira/browse/TWILL-80 Terence On Thu, May 15, 2014 at 3:58 PM, Erick Tryzelaar <[email protected]> wrote: > Hello Terence, > > Thanks for getting back to me. So it does seem to be some timing issue. > I've updated my gist (https://gist.github.com/erickt/7b16d695b64384015b41) > to add some extra controls and a driver script. If I do a sleep of `0`, it > will sometimes work, and sometimes exit before all the `"done"` messages > have been sent. However, If I do a sleep of `10`, it seems like I get all > the messages. > > > On Mon, May 12, 2014 at 10:39 AM, Terence Yim <[email protected]> wrote: > >> Hi Erick, >> >> So all logs from node3 are entirely missing from the driver view? >> That's sounds something wrong to me. >> >> In Twill, it'll always try to flush the last bit of logs when a >> container shutdown, however if that flush failed, there would be no >> retry, since we don't want to hang the stopping of application. >> >> However in your case, it seems like you get no log from a particular >> node. Have you try to add couple seconds sleep in your Runnable.run() >> before return and see if it is due to shutdown issue or something >> else? >> >> Terence >> >> On Fri, May 9, 2014 at 5:44 PM, Erick Tryzelaar >> <[email protected]> wrote: >> > Good evening, >> > >> > I'm not quite sure if this is a bug in my code or Twill. I've been >> working >> > away on TWILL-78, but I'm running into some basic issues with me not >> > seeming to receive all the log messages from my containers. You can find >> a >> > copy of my toy application here: >> > https://gist.github.com/erickt/7b16d695b64384015b41. I've been testing >> > twill with 3 containers. Occasionally I get everything I expect, but >> > sometimes I only seem to get a subset of the log message on the worker >> > nodes. Here's an example. While in /var/log/hadoop-yarn/container/... on >> > the nodes have: >> > >> > node1: >> > ... >> > Launching main: public static void >> > >> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[]) >> > throws java.lang.Exception [] >> > 2014-05-10 00:20:09,334 - ERROR >> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] - >> > entering barrier >> > 2014-05-10 00:20:09,352 - WARN >> > [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There >> are >> > no ConnectionStateListeners registered. >> > 2014-05-10 00:20:11,187 - ERROR >> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in >> > barrier >> > 2014-05-10 00:20:11,188 - ERROR >> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] - >> woken up >> > 2014-05-10 00:20:11,830 - ERROR >> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out >> of >> > barrier >> > 2014-05-10 00:20:11,831 - ERROR >> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done >> > Main class completed. >> > Launcher completed >> > Cleanup directory tmp/twill.launcher-1399681206035-0 >> > ---- >> > >> > node2: >> > ... >> > Launching main: public static void >> > >> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[]) >> > throws java.lang.Exception [] >> > 2014-05-10 00:20:00,133 - ERROR >> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] - >> > entering barrier >> > 2014-05-10 00:20:00,158 - WARN >> > [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There >> are >> > no ConnectionStateListeners registered. >> > 2014-05-10 00:20:02,161 - ERROR >> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in >> > barrier >> > 2014-05-10 00:20:02,161 - ERROR >> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] - >> woken up >> > 2014-05-10 00:20:02,979 - ERROR >> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out >> of >> > barrier >> > 2014-05-10 00:20:02,979 - ERROR >> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done >> > Main class completed. >> > Launcher completed >> > Cleanup directory tmp/twill.launcher-1399681196232-0 >> > ---- >> > >> > node3: >> > ... >> > Launching main: public static void >> > >> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[]) >> > throws java.lang.Exception [] >> > 2014-05-10 00:20:01,191 - ERROR >> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] - >> > entering barrier >> > 2014-05-10 00:20:01,197 - WARN >> > [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There >> are >> > no ConnectionStateListeners registered. >> > 2014-05-10 00:20:02,768 - ERROR >> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in >> > barrier >> > 2014-05-10 00:20:02,769 - ERROR >> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] - >> woken up >> > 2014-05-10 00:20:03,587 - ERROR >> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out >> of >> > barrier >> > 2014-05-10 00:20:03,588 - ERROR >> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done >> > Main class completed. >> > Launcher completed >> > Cleanup directory tmp/twill.launcher-1399681197698-0 >> > ---- >> > >> > But the driver script in this case only shows the output from 2 nodes: >> > >> > ---- >> > 2014-05-09 17:19:15,137 - WARN [main:o.a.h.u.NativeCodeLoader@62] - >> Unable >> > to load native-hadoop library for your platform... using builtin-java >> > classes where applicable >> > 2014-05-09 17:19:16,297 - ERROR [main:o.l.g.t.GraphlabApplication@136] - >> > before getting completion >> > 2014-05-10T00:20:00,133Z ERROR o.l.g.t.GraphlabApplication [node1] >> > [ServiceDelegate] >> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:57) - >> > entering barrier >> > 2014-05-10T00:20:09,334Z ERROR o.l.g.t.GraphlabApplication [node2] >> > [ServiceDelegate] >> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:57) - >> > entering barrier >> > 2014-05-10T00:20:09,352Z WARN o.a.c.f.s.ConnectionStateManager [node2] >> > [ConnectionStateManager-0] >> > ConnectionStateManager:processEvents(ConnectionStateManager.java:212) - >> > There are no ConnectionStateListeners registered. >> > 2014-05-10T00:20:11,187Z ERROR o.l.g.t.GraphlabApplication [node2] >> > [ServiceDelegate] >> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:66) - >> in >> > barrier >> > 2014-05-10T00:20:11,188Z ERROR o.l.g.t.GraphlabApplication [node2] >> > [ServiceDelegate] >> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:76) - >> > woken up >> > 2014-05-10T00:20:11,830Z ERROR o.l.g.t.GraphlabApplication [node2] >> > [ServiceDelegate] >> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:83) - >> out >> > of barrier >> > 2014-05-10T00:20:11,831Z ERROR o.l.g.t.GraphlabApplication [node2] >> > [ServiceDelegate] >> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:93) - >> done >> > 2014-05-10T00:20:00,158Z WARN o.a.c.f.s.ConnectionStateManager [node1] >> > [ConnectionStateManager-0] >> > ConnectionStateManager:processEvents(ConnectionStateManager.java:212) - >> > There are no ConnectionStateListeners registered. >> > 2014-05-10T00:20:02,161Z ERROR o.l.g.t.GraphlabApplication [node1] >> > [ServiceDelegate] >> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:66) - >> in >> > barrier >> > 2014-05-10T00:20:02,161Z ERROR o.l.g.t.GraphlabApplication [node1] >> > [ServiceDelegate] >> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:76) - >> > woken up >> > 2014-05-10T00:20:02,979Z ERROR o.l.g.t.GraphlabApplication [node1] >> > [ServiceDelegate] >> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:83) - >> out >> > of barrier >> > 2014-05-10T00:20:02,979Z ERROR o.l.g.t.GraphlabApplication [node1] >> > [ServiceDelegate] >> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:93) - >> done >> > 14/05/09 17:20:13 INFO consumer.SimpleConsumer: Reconnect due to socket >> > error: Connection reset by peer >> > 2014-05-09 17:20:13,310 - ERROR [main:o.l.g.t.GraphlabApplication@144] - >> > after shutting down >> > 2014-05-09 17:20:13,312 - ERROR >> [Thread-3:o.l.g.t.GraphlabApplication$1@130] >> > - shutting down >> > --- >> > >> > So is this something I'm doing wrong? Or is Twill or Kafka somehow >> shutting >> > down before all the messages have been sent? >> > >> > Thanks for any help, >> > -Erick >>
