Hi Bobby, Just an update on the stuck Trident transactional topology issue -- I've upgraded to Storm 0.9.2-incubating (from 0.9.1-incubating) and can't reproduce the bug anymore. Will keep you posted if any issues arise.
Regards, Danijel On Mon, Jun 16, 2014 at 7:56 PM, Bobby Evans <ev...@yahoo-inc.com> wrote: > I have not seen this before, if you could file a JIRA on this that would > be great. > > - Bobby > > From: Danijel Schiavuzzi <dani...@schiavuzzi.com> > Reply-To: "user@storm.incubator.apache.org" < > user@storm.incubator.apache.org> > Date: Wednesday, June 4, 2014 at 10:30 AM > To: "user@storm.incubator.apache.org" <user@storm.incubator.apache.org>, " > d...@storm.incubator.apache.org" <d...@storm.incubator.apache.org> > Subject: Trident transactional topology stuck re-emitting batches with > Netty, but running fine with ZMQ (was Re: Topology is stuck) > > Hi all, > > I've managed to reproduce the stuck topology problem and it seems it's due > to the Netty transport. Running with ZMQ transport enabled now and I > haven't been able to reproduce this. > > The problem is basically a Trident/Kafka transactional topology getting > stuck, i.e. re-emitting the same batches over and over again. This happens > after the Storm workers restart a few times due to Kafka spout throwing > RuntimeExceptions (because of the Kafka consumer in the spout timing out > with a SocketTimeoutException due to some temporary network problems). > Sometimes the topology is stuck after just one worker is restarted, and > sometimes a few worker restarts are needed to trigger the problem. > > I simulated the Kafka spout socket timeouts by blocking network access > from Storm to my Kafka machines (with an iptables firewall rule). Most of > the time the spouts (workers) would restart normally (after re-enabling > access to Kafka) and the topology would continue to process batches, but > sometimes the topology would get stuck re-emitting batches after the > crashed workers restarted. Killing and re-submitting the topology manually > fixes this always, and processing continues normally. > > I haven't been able to reproduce this scenario after reverting my Storm > cluster's transport to ZeroMQ. With Netty transport, I can almost always > reproduce the problem by causing a worker to restart a number of times > (only about 4-5 worker restarts are enough to trigger this). > > Any hints on this? Anyone had the same problem? It does seem a serious > issue as it affect the reliability and fault tolerance of the Storm cluster. > > In the meantime, I'll try to prepare a reproducible test case for this. > > Thanks, > > Danijel > > > On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi < > dani...@schiavuzzi.com> wrote: > >> To (partially) answer my own question -- I still have no idea on the >> cause of the stuck topology, but re-submitting the topology helps -- after >> re-submitting my topology is now running normally. >> >> >> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi < >> dani...@schiavuzzi.com> wrote: >> >>> Also, I did have multiple cases of my IBackingMap workers dying >>> (because of RuntimeExceptions) but successfully restarting afterwards (I >>> throw RuntimeExceptions in the BackingMap implementation as my strategy in >>> rare SQL database deadlock situations to force a worker restart and to >>> fail+retry the batch). >>> >>> From the logs, one such IBackingMap worker death (and subsequent >>> restart) resulted in the Kafka spout re-emitting the pending tuple: >>> >>> 2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting >>> batch, attempt 29698959:736 >>> >>> This is of course the normal behavior of a transactional topology, but >>> this is the first time I've encountered a case of a batch retrying >>> indefinitely. This is especially suspicious since the topology has been >>> running fine for 20 days straight, re-emitting batches and restarting >>> IBackingMap workers quite a number of times. >>> >>> I can see in my IBackingMap backing SQL database that the batch with the >>> exact txid value 29698959 has been committed -- but I suspect that could >>> come from another BackingMap, since there are two BackingMap instances >>> running (paralellismHint 2). >>> >>> However, I have no idea why the batch is being retried indefinitely >>> now nor why it hasn't been successfully acked by Trident. >>> >>> Any suggestions on the area (topology component) to focus my research on? >>> >>> Thanks, >>> >>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi < >>> dani...@schiavuzzi.com> wrote: >>> >>>> Hello, >>>> >>>> I'm having problems with my transactional Trident topology. It has been >>>> running fine for about 20 days, and suddenly is stuck processing a single >>>> batch, with no tuples being emitted nor tuples being persisted by the >>>> TridentState (IBackingMap). >>>> >>>> It's a simple topology which consumes messages off a Kafka queue. The >>>> spout is an instance of storm-kafka-0.8-plus TransactionalTridentKafkaSpout >>>> and I use the trident-mssql transactional TridentState implementation to >>>> persistentAggregate() data into a SQL database. >>>> >>>> In Zookeeper I can see Storm is re-trying a batch, i.e. >>>> >>>> "/transactional/<myTopologyName>/coordinator/currattempts" is >>>> "{"29698959":6487}" >>>> >>>> ... and the attempt count keeps increasing. It seems the batch with >>>> txid 29698959 is stuck, as the attempt count in Zookeeper keeps increasing >>>> -- seems like the batch isn't being acked by Trident and I have no idea >>>> why, especially since the topology has been running successfully the last >>>> 20 days. >>>> >>>> I did rebalance the topology on one occasion, after which it >>>> continued running normally. Other than that, no other modifications were >>>> done. Storm is at version 0.9.0.1. >>>> >>>> Any hints on how to debug the stuck topology? Any other useful info I >>>> might provide? >>>> >>>> Thanks, >>>> >>>> -- >>>> Danijel Schiavuzzi >>>> >>>> E: dani...@schiavuzzi.com >>>> W: www.schiavuzzi.com >>>> T: +385989035562 >>>> Skype: danijel.schiavuzzi >>>> >>> >>> >>> >>> -- >>> Danijel Schiavuzzi >>> >>> E: dani...@schiavuzzi.com >>> W: www.schiavuzzi.com >>> T: +385989035562 >>> Skype: danijel.schiavuzzi >>> >> >> >> >> -- >> Danijel Schiavuzzi >> >> E: dani...@schiavuzzi.com >> W: www.schiavuzzi.com >> T: +385989035562 >> Skype: danijels7 >> > > > > -- > Danijel Schiavuzzi > > E: dani...@schiavuzzi.com > W: www.schiavuzzi.com > T: +385989035562 > Skype: danijels7 > -- Danijel Schiavuzzi E: dani...@schiavuzzi.com W: www.schiavuzzi.com T: +385989035562 Skype: danijels7