Hello Terence,

Thanks for getting back to me. So it does seem to be some timing issue.
I've updated my gist (https://gist.github.com/erickt/7b16d695b64384015b41)
to add some extra controls and a driver script. If I do a sleep of `0`, it
will sometimes work, and sometimes exit before all the `"done"` messages
have been sent. However, If I do a sleep of `10`, it seems like I get all
the messages.


On Mon, May 12, 2014 at 10:39 AM, Terence Yim <[email protected]> wrote:

> Hi Erick,
>
> So all logs from node3 are entirely missing from the driver view?
> That's sounds something wrong to me.
>
> In Twill, it'll always try to flush the last bit of logs when a
> container shutdown, however if that flush failed, there would be no
> retry, since we don't want to hang the stopping of application.
>
> However in your case, it seems like you get no log from a particular
> node. Have you try to add couple seconds sleep in your Runnable.run()
> before return and see if it is due to shutdown issue or something
> else?
>
> Terence
>
> On Fri, May 9, 2014 at 5:44 PM, Erick Tryzelaar
> <[email protected]> wrote:
> > Good evening,
> >
> > I'm not quite sure if this is a bug in my code or Twill. I've been
> working
> > away on TWILL-78, but I'm running into some basic issues with me not
> > seeming to receive all the log messages from my containers. You can find
> a
> > copy of my toy application here:
> > https://gist.github.com/erickt/7b16d695b64384015b41. I've been testing
> > twill with 3 containers. Occasionally I get everything I expect, but
> > sometimes I only seem to get a subset of the log message on the worker
> > nodes. Here's an example. While in /var/log/hadoop-yarn/container/... on
> > the nodes have:
> >
> > node1:
> > ...
> > Launching main: public static void
> >
> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
> > throws java.lang.Exception []
> > 2014-05-10 00:20:09,334 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
> > entering barrier
> > 2014-05-10 00:20:09,352 - WARN
> > [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There
> are
> > no ConnectionStateListeners registered.
> > 2014-05-10 00:20:11,187 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
> > barrier
> > 2014-05-10 00:20:11,188 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] -
> woken up
> > 2014-05-10 00:20:11,830 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out
> of
> > barrier
> > 2014-05-10 00:20:11,831 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
> > Main class completed.
> > Launcher completed
> > Cleanup directory tmp/twill.launcher-1399681206035-0
> > ----
> >
> > node2:
> > ...
> > Launching main: public static void
> >
> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
> > throws java.lang.Exception []
> > 2014-05-10 00:20:00,133 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
> > entering barrier
> > 2014-05-10 00:20:00,158 - WARN
> > [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There
> are
> > no ConnectionStateListeners registered.
> > 2014-05-10 00:20:02,161 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
> > barrier
> > 2014-05-10 00:20:02,161 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] -
> woken up
> > 2014-05-10 00:20:02,979 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out
> of
> > barrier
> > 2014-05-10 00:20:02,979 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
> > Main class completed.
> > Launcher completed
> > Cleanup directory tmp/twill.launcher-1399681196232-0
> > ----
> >
> > node3:
> > ...
> > Launching main: public static void
> >
> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
> > throws java.lang.Exception []
> > 2014-05-10 00:20:01,191 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
> > entering barrier
> > 2014-05-10 00:20:01,197 - WARN
> > [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There
> are
> > no ConnectionStateListeners registered.
> > 2014-05-10 00:20:02,768 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
> > barrier
> > 2014-05-10 00:20:02,769 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] -
> woken up
> > 2014-05-10 00:20:03,587 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out
> of
> > barrier
> > 2014-05-10 00:20:03,588 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
> > Main class completed.
> > Launcher completed
> > Cleanup directory tmp/twill.launcher-1399681197698-0
> > ----
> >
> > But the driver script in this case only shows the output from 2 nodes:
> >
> > ----
> > 2014-05-09 17:19:15,137 - WARN  [main:o.a.h.u.NativeCodeLoader@62] -
> Unable
> > to load native-hadoop library for your platform... using builtin-java
> > classes where applicable
> > 2014-05-09 17:19:16,297 - ERROR [main:o.l.g.t.GraphlabApplication@136] -
> > before getting completion
> > 2014-05-10T00:20:00,133Z ERROR o.l.g.t.GraphlabApplication [node1]
> > [ServiceDelegate]
> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:57) -
> > entering barrier
> > 2014-05-10T00:20:09,334Z ERROR o.l.g.t.GraphlabApplication [node2]
> > [ServiceDelegate]
> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:57) -
> > entering barrier
> > 2014-05-10T00:20:09,352Z WARN  o.a.c.f.s.ConnectionStateManager [node2]
> > [ConnectionStateManager-0]
> > ConnectionStateManager:processEvents(ConnectionStateManager.java:212) -
> > There are no ConnectionStateListeners registered.
> > 2014-05-10T00:20:11,187Z ERROR o.l.g.t.GraphlabApplication [node2]
> > [ServiceDelegate]
> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:66) -
> in
> > barrier
> > 2014-05-10T00:20:11,188Z ERROR o.l.g.t.GraphlabApplication [node2]
> > [ServiceDelegate]
> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:76) -
> > woken up
> > 2014-05-10T00:20:11,830Z ERROR o.l.g.t.GraphlabApplication [node2]
> > [ServiceDelegate]
> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:83) -
> out
> > of barrier
> > 2014-05-10T00:20:11,831Z ERROR o.l.g.t.GraphlabApplication [node2]
> > [ServiceDelegate]
> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:93) -
> done
> > 2014-05-10T00:20:00,158Z WARN  o.a.c.f.s.ConnectionStateManager [node1]
> > [ConnectionStateManager-0]
> > ConnectionStateManager:processEvents(ConnectionStateManager.java:212) -
> > There are no ConnectionStateListeners registered.
> > 2014-05-10T00:20:02,161Z ERROR o.l.g.t.GraphlabApplication [node1]
> > [ServiceDelegate]
> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:66) -
> in
> > barrier
> > 2014-05-10T00:20:02,161Z ERROR o.l.g.t.GraphlabApplication [node1]
> > [ServiceDelegate]
> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:76) -
> > woken up
> > 2014-05-10T00:20:02,979Z ERROR o.l.g.t.GraphlabApplication [node1]
> > [ServiceDelegate]
> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:83) -
> out
> > of barrier
> > 2014-05-10T00:20:02,979Z ERROR o.l.g.t.GraphlabApplication [node1]
> > [ServiceDelegate]
> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:93) -
> done
> > 14/05/09 17:20:13 INFO consumer.SimpleConsumer: Reconnect due to socket
> > error: Connection reset by peer
> > 2014-05-09 17:20:13,310 - ERROR [main:o.l.g.t.GraphlabApplication@144] -
> > after shutting down
> > 2014-05-09 17:20:13,312 - ERROR
> [Thread-3:o.l.g.t.GraphlabApplication$1@130]
> > - shutting down
> > ---
> >
> > So is this something I'm doing wrong? Or is Twill or Kafka somehow
> shutting
> > down before all the messages have been sent?
> >
> > Thanks for any help,
> > -Erick
>

Reply via email to