Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

Danijel Schiavuzzi Thu, 03 Jul 2014 06:37:39 -0700

Hi Bobby,

Just an update on the stuck Trident transactional topology issue -- I've
upgraded to Storm 0.9.2-incubating (from 0.9.1-incubating) and can't
reproduce the bug anymore. Will keep you posted if any issues arise.


Regards,

Danijel


On Mon, Jun 16, 2014 at 7:56 PM, Bobby Evans <ev...@yahoo-inc.com> wrote:

>  I have not seen this before, if you could file a JIRA on this that would
> be great.
>
>  - Bobby
>
>   From: Danijel Schiavuzzi <dani...@schiavuzzi.com>
> Reply-To: "user@storm.incubator.apache.org" <
> user@storm.incubator.apache.org>
> Date: Wednesday, June 4, 2014 at 10:30 AM
> To: "user@storm.incubator.apache.org" <user@storm.incubator.apache.org>, "
> d...@storm.incubator.apache.org" <d...@storm.incubator.apache.org>
> Subject: Trident transactional topology stuck re-emitting batches with
> Netty, but running fine with ZMQ (was Re: Topology is stuck)
>
>   Hi all,
>
> I've managed to reproduce the stuck topology problem and it seems it's due
> to the Netty transport. Running with ZMQ transport enabled now and I
> haven't been able to reproduce this.
>
>  The problem is basically a Trident/Kafka transactional topology getting
> stuck, i.e. re-emitting the same batches over and over again. This happens
> after the Storm workers restart a few times due to Kafka spout throwing
> RuntimeExceptions (because of the Kafka consumer in the spout timing out
> with a SocketTimeoutException due to some temporary network problems).
> Sometimes the topology is stuck after just one worker is restarted, and
> sometimes a few worker restarts are needed to trigger the problem.
>
> I simulated the Kafka spout socket timeouts by blocking network access
> from Storm to my Kafka machines (with an iptables firewall rule). Most of
> the time the spouts (workers) would restart normally (after re-enabling
> access to Kafka) and the topology would continue to process batches, but
> sometimes the topology would get stuck re-emitting batches after the
> crashed workers restarted. Killing and re-submitting the topology manually
> fixes this always, and processing continues normally.
>
>  I haven't been able to reproduce this scenario after reverting my Storm
> cluster's transport to ZeroMQ. With Netty transport, I can almost always
> reproduce the problem by causing a worker to restart a number of times
> (only about 4-5 worker restarts are enough to trigger this).
>
>  Any hints on this? Anyone had the same problem? It does seem a serious
> issue as it affect the reliability and fault tolerance of the Storm cluster.
>
>  In the meantime, I'll try to prepare a reproducible test case for this.
>
>  Thanks,
>
> Danijel
>
>
> On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi <
> dani...@schiavuzzi.com> wrote:
>
>> To (partially) answer my own question -- I still have no idea on the
>> cause of the stuck topology, but re-submitting the topology helps -- after
>> re-submitting my topology is now running normally.
>>
>>
>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <
>> dani...@schiavuzzi.com> wrote:
>>
>>>  Also, I did have multiple cases of my IBackingMap workers dying
>>> (because of RuntimeExceptions) but successfully restarting afterwards (I
>>> throw RuntimeExceptions in the BackingMap implementation as my strategy in
>>> rare SQL database deadlock situations to force a worker restart and to
>>> fail+retry the batch).
>>>
>>>  From the logs, one such IBackingMap worker death (and subsequent
>>> restart) resulted in the Kafka spout re-emitting the pending tuple:
>>>
>>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting
>>> batch, attempt 29698959:736
>>>
>>>  This is of course the normal behavior of a transactional topology, but
>>> this is the first time I've encountered a case of a batch retrying
>>> indefinitely. This is especially suspicious since the topology has been
>>> running fine for 20 days straight, re-emitting batches and restarting
>>> IBackingMap workers quite a number of times.
>>>
>>> I can see in my IBackingMap backing SQL database that the batch with the
>>> exact txid value 29698959 has been committed -- but I suspect that could
>>> come from another BackingMap, since there are two BackingMap instances
>>> running (paralellismHint 2).
>>>
>>>  However, I have no idea why the batch is being retried indefinitely
>>> now nor why it hasn't been successfully acked by Trident.
>>>
>>> Any suggestions on the area (topology component) to focus my research on?
>>>
>>>  Thanks,
>>>
>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <
>>> dani...@schiavuzzi.com> wrote:
>>>
>>>>   Hello,
>>>>
>>>> I'm having problems with my transactional Trident topology. It has been
>>>> running fine for about 20 days, and suddenly is stuck processing a single
>>>> batch, with no tuples being emitted nor tuples being persisted by the
>>>> TridentState (IBackingMap).
>>>>
>>>> It's a simple topology which consumes messages off a Kafka queue. The
>>>> spout is an instance of storm-kafka-0.8-plus TransactionalTridentKafkaSpout
>>>> and I use the trident-mssql transactional TridentState implementation to
>>>> persistentAggregate() data into a SQL database.
>>>>
>>>>  In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>>
>>>>      "/transactional/<myTopologyName>/coordinator/currattempts" is
>>>> "{"29698959":6487}"
>>>>
>>>> ... and the attempt count keeps increasing. It seems the batch with
>>>> txid 29698959 is stuck, as the attempt count in Zookeeper keeps increasing
>>>> -- seems like the batch isn't being acked by Trident and I have no idea
>>>> why, especially since the topology has been running successfully the last
>>>> 20 days.
>>>>
>>>>  I did rebalance the topology on one occasion, after which it
>>>> continued running normally. Other than that, no other modifications were
>>>> done. Storm is at version 0.9.0.1.
>>>>
>>>>  Any hints on how to debug the stuck topology? Any other useful info I
>>>> might provide?
>>>>
>>>>  Thanks,
>>>>
>>>> --
>>>> Danijel Schiavuzzi
>>>>
>>>> E: dani...@schiavuzzi.com
>>>> W: www.schiavuzzi.com
>>>> T: +385989035562
>>>> Skype: danijel.schiavuzzi
>>>>
>>>
>>>
>>>
>>> --
>>> Danijel Schiavuzzi
>>>
>>> E: dani...@schiavuzzi.com
>>> W: www.schiavuzzi.com
>>> T: +385989035562
>>> Skype: danijel.schiavuzzi
>>>
>>
>>
>>
>> --
>> Danijel Schiavuzzi
>>
>> E: dani...@schiavuzzi.com
>> W: www.schiavuzzi.com
>> T: +385989035562
>>  Skype: danijels7
>>
>
>
>
> --
> Danijel Schiavuzzi
>
> E: dani...@schiavuzzi.com
> W: www.schiavuzzi.com
> T: +385989035562
> Skype: danijels7
>



-- 
Danijel Schiavuzzi

E: dani...@schiavuzzi.com
W: www.schiavuzzi.com
T: +385989035562
Skype: danijels7

Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

Reply via email to