Hey Zach, Cool. I've engaged with some devs to understand if SimpleConsumer might be swallowing the interrupt.
Cheers, Chris On 1/15/15 12:29 PM, "Zach Cox" <[email protected]> wrote: >Hi Chris, > >Here is an example of logs when I see proper shutdown: > >https://gist.githubusercontent.com/zcox/6ec8910bd3f18e36c1a2/raw/666eae245 >11490bf51a66e56fd90c794ea6b6282/stdout > >That does show the InterruptedException "Got interrupt exception in broker >proxy thread." line. > >Here's java -version output on the Samza container: > >java version "1.6.0_33" OpenJDK Runtime Environment (IcedTea6 1.13.5) >(6b33-1.13.5-1ubuntu0.14.04) OpenJDK 64-Bit Server VM (build 23.25-b01, >mixed mode) > >Currently all of this stuff (Mesos, Kafka, Samza, etc) is running on my >laptop. > >I've just been using the released Samza 0.8.0 jars from maven central - >will try to get Samza building locally and add that tight interrupt loop, >and see what happens. > >Thanks, >Zach > > >On Thu, Jan 15, 2015 at 1:11 PM, Chris Riccomini < >[email protected]> wrote: > >> Hey Zach, >> >> Can you try modifying this line in BrokerProxy: >> >> >> >>https://github.com/apache/incubator-samza/blob/0.8.0/samza-kafka/src/main >>/s >> cala/org/apache/samza/system/kafka/BrokerProxy.scala#L283 >> >> To interrupt a lot in a tight loop and continue checking the thread >> status? Something like: >> >> while(thread.isAlive) { >> thread.interrupt >> } >> >> I want to see if we can catch the thread at a point when it's sleeping, >> not doing IO on the wire. If we can, the thread should shutdown cleanly, >> and we should see the "Got interrupt exception in broker proxy thread." >> exception. >> >> Cheers, >> Chris >> >> On 1/15/15 11:01 AM, "Chris Riccomini" <[email protected]> wrote: >> >> >Hey Zach, >> > >> >I'm beginning to wonder if this line is the problem: >> > >> > >> > >> >>https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/cons >>u >> >m >> >er/SimpleConsumer.scala#L75 >> > >> >If the sendRequest in SimpleConsumer catches a Throwable, it will catch >> >ClosedByInterruptException. This will cause the interrupt flag to get >> >reset, I think. If the consumer the retries the send, and it succeeds, >> >then I think it will continue on and ignore the interrupt. Going to >>poke a >> >Kafka dev to see if I am misunderstanding something. >> > >> >Cheers, >> >Chris >> > >> >On 1/15/15 10:52 AM, "Chris Riccomini" <[email protected]> wrote: >> > >> >>Hey Zach, >> >> >> >>Hmm. It sounds like something is catching the InterruptedException, >>and >> >>letting the interrupt flag get reset. If that happens, the >> >>ExponentialSleepStrategy [1] will continue running. >> >> >> >>I walked through the code, but couldn't find where it might be doing >>so, >> >>but it's hard to catch these without a test. >> >> >> >>> I have never seen that "Restarting consumer due to ..." warn line in >> >>>the >> >>>logs. >> >> >> >>If that's the case, then I don't think the BrokerProxy can be >> >>reconnecting >> >>since reconnect must be set to true, which would require seeing that >>log >> >>line [2]. >> >> >> >>Out of curiosity, what version of the JVM are you using? Trying to >> >>replicate this is proving tricky. >> >> >> >>I'm beginning to wonder if this has to do with InterruptedException >>vs. >> >>ClosedByInterruptException. In TestStatefulTask, I only see this the >> >>first >> >>log line: >> >> >> >> case e: InterruptedException => info("Got interrupt exception in >>broker >> >>proxy thread.") >> >> case e: ClosedByInterruptException => info("Got closed by interrupt >> >>exception in broker proxy thread.") >> >> >> >>This is probably because the test connects to a local Kafka broker, so >> >>nio >> >>socket operations are extremely fast, and the test almost always gets >> >>interrupted on a Thread.sleep, not a blocking socket operation. I'm >> >>wondering if a ClosedByInterruptException exception would cause the >> >>hanging that you're seeing. >> >> >> >> >> >>Out of curiosity, is your container on the same machine as your >>broker, >> >>or >> >>a different machine? >> >> >> >>Cheers, >> >>Chris >> >> >> >>[1] >> >> >> >>https://github.com/apache/incubator-samza/blob/0.8.0/samza-core/src/main/ >> >>s >> >>c >> >>ala/org/apache/samza/util/ExponentialSleepStrategy.scala#L80 >> >>[2] >> >> >> >>https://github.com/apache/incubator-samza/blob/0.8.0/samza-kafka/src/main >> >>/ >> >>s >> >>cala/org/apache/samza/system/kafka/BrokerProxy.scala#L133 >> >> >> >>On 1/15/15 9:38 AM, "Zach Cox" <[email protected]> wrote: >> >> >> >>>Hi Chris, >> >>> >> >>> >> >>>> Looking at your logs (from previous email, granted, different >> >>>>execution), >> >>>> I see that we get as far as SimpleConsumer.disconnect(): >> >>>> >> >>>> [DEBUG] [2015-01-14 17:19:12,596] [SAMZA-BROKER-PROXY-BrokerProxy >> >>>>...] >> >>>> o.a.s.s.k.DefaultFetchSimpleConsumer: Disconnecting from >> >>>>localhost:9192 >> >>>> >> >>>> This is invoked by BrokerProxy.scala:135. The only way this should >>get >> >>>> invoked is if an exception is caught. Do you see this log line >> >>>>anywhere? >> >>>> >> >>>> warn("Restarting consumer due to %s. Turn on debugging to get a >>full >> >>>> stack trace." format exception) >> >>>> >> >>> >> >>>DefaultFetchConsumer.close() calls SimpleConsumer.close(), which >>calls >> >>>disconnect(), which also logs that "Disconnecting from >>localhost:9192" >> >>>line: >> >>> >> >>> >> https://github.com/apache/incubator-samza/blob/master/samza-kafka/src/ma >> >>>i >> >>>n >> >>>>>/scala/org/apache/samza/system/kafka/DefaultFetchSimpleConsumer.scala# >>>>>L4 >> >>>4 >> >>> >> >>> >> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/con >> >>>s >> >>>u >> >>>mer/SimpleConsumer.scala#L50 >> >>> >> >>>I have never seen that "Restarting consumer due to ..." warn line in >>the >> >>>logs. >> >>> >> >>>Also note that I'm using Samza 0.8.0, not master. >> >>> >> >>> >> >>>> I ran our TestStatefulTask test a few times, and verified that the >> >>>> consumer CAN be shutdown: >> >>>> >> >>>> [ThreadJob] INFO org.apache.samza.container.SamzaContainer - >> >>>>Shutting >> >>>> down consumer multiplexer. >> >>>> [ThreadJob] INFO org.apache.samza.system.kafka.BrokerProxy - >> >>>>Shutting >> >>>> down BrokerProxy for localhost:59075 >> >>>> 2015-01-15 08:59:26 DefaultFetchSimpleConsumer [WARN] Reconnect >>due >> >>>>to >> >>>> socket error: null >> >>>> [SAMZA-BROKER-PROXY-BrokerProxy thread pointed at ...] INFO >> >>>> org.apache.samza.system.kafka.BrokerProxy - Got closed by interrupt >> >>>> exception in broker proxy thread. >> >>>> [SAMZA-BROKER-PROXY-BrokerProxy thread pointed at ...] INFO >> >>>> org.apache.samza.system.kafka.BrokerProxy - Shutting down due to >> >>>>interrupt. >> >>>> // consumer shutdown is complete here.. The container has moved >>on >> >>>>to >> >>>> shutting down the producer >> >>>> [ThreadJob] INFO org.apache.samza.container.SamzaContainer - >> >>>>Shutting >> >>>> down producer multiplexer. >> >>>> >> >>> >> >>>Yesterday afternoon I was showing this to a colleague, and we did >> >>>observe >> >>>a >> >>>proper clean shutdown once. Then the next 2 times I showed him we saw >> >>>the >> >>>BrokerProxy blocking issue. So I also observe proper shutdown >>sometimes >> >>>too, just seems a lot more often that BrokerProxy blocks and it >>doesn't >> >>>shut down. >> >>> >> >>>Thanks, >> >>>Zach >> >>> >> >>> >> >>> >> >>> >> >>>> >> >>>> >> >>>> Cheers, >> >>>> Chris >> >>>> >> >>>> On 1/14/15 7:24 PM, "Zach Cox" <[email protected]> wrote: >> >>>> >> >>>> >Hi Chris, >> >>>> > >> >>>> >I did a thread dump during the shutdown process, when I could >>tell it >> >>>> >wasn't shutting down properly [1]. You can see this thread dump in >> >>>>the >> >>>> >context of the other logging at shutdown. >> >>>> > >> >>>> >The main thread is indeed blocked on the Thread.join call in >> >>>> >BrokerProxy.stop. The BrokerProxy thread looks like it's still >> >>>>consuming >> >>>> >from Kafka? >> >>>> > >> >>>> >The thread dump was a bit tricky; this Samza container is running >>in >> >>>>a >> >>>> >Docker container on boot2docker, which is Tiny Core Linux, i.e. no >> >>>>jstack. >> >>>> >I never new this, but `sudo kill -3 [pid]` tells the jvm to do a >> >>>>thread >> >>>> >dump to stdout [2] :) >> >>>> > >> >>>> >Thanks, >> >>>> >Zach >> >>>> > >> >>>> >[1] >> >>>> > >> >>>> >> >>>> >> https://gist.githubusercontent.com/zcox/bb47a61d4ae1acd54056/raw/83224d >> >>>>2 >> >>>>0 >> >>>>a >> >>>> >61de20bb47bcf916d92930a71cd97ad/gistfile1.txt >> >>>> > >> >>>> >[2] >> >>>> > >> >>>> >> >>>> >> http://serverfault.com/questions/574745/jstack-alternative-for-linux-re >> >>>>d >> >>>>h >> >>>>a >> >>>> >t-6 >> >>>> > >> >>>> > >> >>>> >On Wed, Jan 14, 2015 at 7:30 PM, Chris Riccomini < >> >>>> >[email protected]> wrote: >> >>>> > >> >>>> >> Hey Zach, >> >>>> >> >> >>>> >> It sounds likely that the BrokerProxy thread is blocking or >> >>>>improperly >> >>>> >> catching an interrupt exception. >> >>>> >> >> >>>> >> Can you take a thread dump? My guess is that you'll see that the >> >>>>main >> >>>> >> thread is blocked on BrokerProxy.stop: >> >>>> >> >> >>>> >> thread.interrupt >> >>>> >> thread.join >> >>>> >> >> >>>> >> >> >>>> >> It'll likely be blocked on thread.join. If that's the case, I'd >> >>>>like >> >>>>to >> >>>> >> see what the BrokerProxy thread is doing. This line indicates >>that >> >>>>the >> >>>> >> thread seems to be shutting down: >> >>>> >> >> >>>> >> [DEBUG] [2015-01-14 17:19:12,596] >>[SAMZA-BROKER-PROXY-BrokerProxy >> >>>>thread >> >>>> >> pointed at localhost:9192 for client >> >>>> >> samza_consumer-twitter_message_separation-1-1421255914862-1] >> >>>> >> o.a.s.s.k.DefaultFetchSimpleConsumer: Disconnecting from >> >>>>localhost:9192 >> >>>> >> >> >>>> >> But if thread.join is blocking, then the shutdown never seems to >> >>>> >>complete. >> >>>> >> It'd be good to see where these threads are at. >> >>>> >> >> >>>> >> >> >>>> >> Cheers, >> >>>> >> Chris >> >>>> >> >> >>>> >> On 1/14/15 10:49 AM, "Zach Cox" <[email protected]> wrote: >> >>>> >> >> >>>> >> >Hi - related to the discussion yesterday about graceful >>shutdown >> >>>>[1], >> >>>> >> >today >> >>>> >> >I can't seem to get the SamzaContainer to actually shut down. >> >>>> >>Yesterday I >> >>>> >> >was seeing nice, fast, complete shutdown logs [2]. However, >>today >> >>>>the >> >>>> >>last >> >>>> >> >log line I see is related to shutting down BrokerProxy, then >> >>>>nothing >> >>>> >>else >> >>>> >> >happens until the container is finally just SIGKILLed by Docker >> >>>>[3]. >> >>>> >> > >> >>>> >> >The consistency with which clean, fast shutdowns were happening >> >>>> >>yesterday >> >>>> >> >and now they are never happening today leads me to believe I've >> >>>>changed >> >>>> >> >something, but I really can't find anything that would lead to >> >>>>this. >> >>>> >>It's >> >>>> >> >almost like there's some deadlock in BrokerProxy, but I would >> >>>>think >> >>>> >>that >> >>>> >> >would show up randomly - I've tried this about 100 times today, >> >>>>and >> >>>> >>every >> >>>> >> >time the logs are as shown in [3]. Today I also see 2 >>exceptions >> >>>>in >> >>>>the >> >>>> >> >Kafka broker logs [4] but I don't know if those were occurring >> >>>> >>yesterday >> >>>> >> >when Samza was cleanly shutting down, or not. >> >>>> >> > >> >>>> >> >One thing that doesn't seem to be happening today is this log >>line >> >>>>from >> >>>> >> >BrokerProxy: "Got interrupt exception in broker proxy thread." >> >>>> >> > >> >>>> >> >Has anyone seen this, or have any advice on what to try next? >> >>>> >> > >> >>>> >> >Thanks, >> >>>> >> >Zach >> >>>> >> > >> >>>> >> > >> >>>> >> >[1] >> >>>> >> >> >>>> >>> >> >>>> >> >>>> >> http://www.mail-archive.com/[email protected]/msg02246.htm >> >>>>l >> >>>> >> > >> >>>> >> >[2] >> >>>> >> > >> >>>> >> >> >>>> >> >> >>>> >> >>>> >> https://gist.githubusercontent.com/zcox/6ec8910bd3f18e36c1a2/raw/666eae >> >>>>2 >> >>>>4 >> >>>> >>5 >> >>>> >> >11490bf51a66e56fd90c794ea6b6282/stdout >> >>>> >> > >> >>>> >> >[3] >> >>>> >> > >> >>>> >> >> >>>> >> >> >>>> >> >>>> >> https://gist.githubusercontent.com/zcox/673a2ba607c566de7650/raw/d29ea8 >> >>>>b >> >>>>9 >> >>>> >>9 >> >>>> >> >868da9cc4dc1db198315ee1d03bc694/gistfile1.txt >> >>>> >> > >> >>>> >> >[4] >> >>>> >> > >> >>>> >> >> >>>> >> >> >>>> >> >>>> >> https://gist.githubusercontent.com/zcox/f08de55c5d5fe2d70cde/raw/c4be45 >> >>>>3 >> >>>>b >> >>>> >>5 >> >>>> >> >f6a57cfb5acb08508667f4b8ce8c2bd/gistfile1.txt >> >>>> >> >> >>>> >> >> >>>> >> >>>> >> >> >> > >> >>
