Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

Danijel Schiavuzzi Mon, 07 Jul 2014 00:53:05 -0700

Hi Tarkeshwar,

Could you provide a code sample of your topology? Do you have any special
configs enabled?


Thanks,

Danijel


On Mon, Jul 7, 2014 at 9:01 AM, M.Tarkeshwar Rao <tarkeshwa...@gmail.com>
wrote:

> Hi Danijel,
>
> We are able to reproduce this issue with 0.9.2 as well.
> We have two worker setup to run the trident topology.
>
> When we kill one of the worker and again when that killed worker spawn on
> same port(same slot) then that worker not able to communicate with 2nd
> worker.
>
> only transaction attempts are increasing continuously.
>
> But if the killed worker spawn on new slot(new communication port) then it
> working fine. Same behavior as in storm 9.0.1.
>
> Please update me if you get any new development.
>
> Regards
> Tarkeshwar
>
>
> On Thu, Jul 3, 2014 at 7:06 PM, Danijel Schiavuzzi <dani...@schiavuzzi.com
> > wrote:
>
>> Hi Bobby,
>>
>> Just an update on the stuck Trident transactional topology issue -- I've
>> upgraded to Storm 0.9.2-incubating (from 0.9.1-incubating) and can't
>> reproduce the bug anymore. Will keep you posted if any issues arise.
>>
>> Regards,
>>
>> Danijel
>>
>>
>> On Mon, Jun 16, 2014 at 7:56 PM, Bobby Evans <ev...@yahoo-inc.com> wrote:
>>
>>>  I have not seen this before, if you could file a JIRA on this that
>>> would be great.
>>>
>>>  - Bobby
>>>
>>>   From: Danijel Schiavuzzi <dani...@schiavuzzi.com>
>>> Reply-To: "user@storm.incubator.apache.org" <
>>> user@storm.incubator.apache.org>
>>> Date: Wednesday, June 4, 2014 at 10:30 AM
>>> To: "user@storm.incubator.apache.org" <user@storm.incubator.apache.org>,
>>> "d...@storm.incubator.apache.org" <d...@storm.incubator.apache.org>
>>> Subject: Trident transactional topology stuck re-emitting batches with
>>> Netty, but running fine with ZMQ (was Re: Topology is stuck)
>>>
>>>   Hi all,
>>>
>>> I've managed to reproduce the stuck topology problem and it seems it's
>>> due to the Netty transport. Running with ZMQ transport enabled now and I
>>> haven't been able to reproduce this.
>>>
>>>  The problem is basically a Trident/Kafka transactional topology
>>> getting stuck, i.e. re-emitting the same batches over and over again. This
>>> happens after the Storm workers restart a few times due to Kafka spout
>>> throwing RuntimeExceptions (because of the Kafka consumer in the spout
>>> timing out with a SocketTimeoutException due to some temporary network
>>> problems). Sometimes the topology is stuck after just one worker is
>>> restarted, and sometimes a few worker restarts are needed to trigger the
>>> problem.
>>>
>>> I simulated the Kafka spout socket timeouts by blocking network access
>>> from Storm to my Kafka machines (with an iptables firewall rule). Most of
>>> the time the spouts (workers) would restart normally (after re-enabling
>>> access to Kafka) and the topology would continue to process batches, but
>>> sometimes the topology would get stuck re-emitting batches after the
>>> crashed workers restarted. Killing and re-submitting the topology manually
>>> fixes this always, and processing continues normally.
>>>
>>>  I haven't been able to reproduce this scenario after reverting my
>>> Storm cluster's transport to ZeroMQ. With Netty transport, I can almost
>>> always reproduce the problem by causing a worker to restart a number of
>>> times (only about 4-5 worker restarts are enough to trigger this).
>>>
>>>  Any hints on this? Anyone had the same problem? It does seem a serious
>>> issue as it affect the reliability and fault tolerance of the Storm cluster.
>>>
>>>  In the meantime, I'll try to prepare a reproducible test case for this.
>>>
>>>  Thanks,
>>>
>>> Danijel
>>>
>>>
>>> On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi <
>>> dani...@schiavuzzi.com> wrote:
>>>
>>>> To (partially) answer my own question -- I still have no idea on the
>>>> cause of the stuck topology, but re-submitting the topology helps -- after
>>>> re-submitting my topology is now running normally.
>>>>
>>>>
>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <
>>>> dani...@schiavuzzi.com> wrote:
>>>>
>>>>>  Also, I did have multiple cases of my IBackingMap workers dying
>>>>> (because of RuntimeExceptions) but successfully restarting afterwards (I
>>>>> throw RuntimeExceptions in the BackingMap implementation as my strategy in
>>>>> rare SQL database deadlock situations to force a worker restart and to
>>>>> fail+retry the batch).
>>>>>
>>>>>  From the logs, one such IBackingMap worker death (and subsequent
>>>>> restart) resulted in the Kafka spout re-emitting the pending tuple:
>>>>>
>>>>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting
>>>>> batch, attempt 29698959:736
>>>>>
>>>>>  This is of course the normal behavior of a transactional topology,
>>>>> but this is the first time I've encountered a case of a batch retrying
>>>>> indefinitely. This is especially suspicious since the topology has been
>>>>> running fine for 20 days straight, re-emitting batches and restarting
>>>>> IBackingMap workers quite a number of times.
>>>>>
>>>>> I can see in my IBackingMap backing SQL database that the batch with
>>>>> the exact txid value 29698959 has been committed -- but I suspect that
>>>>> could come from another BackingMap, since there are two BackingMap
>>>>> instances running (paralellismHint 2).
>>>>>
>>>>>  However, I have no idea why the batch is being retried indefinitely
>>>>> now nor why it hasn't been successfully acked by Trident.
>>>>>
>>>>> Any suggestions on the area (topology component) to focus my research
>>>>> on?
>>>>>
>>>>>  Thanks,
>>>>>
>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <
>>>>> dani...@schiavuzzi.com> wrote:
>>>>>
>>>>>>   Hello,
>>>>>>
>>>>>> I'm having problems with my transactional Trident topology. It has
>>>>>> been running fine for about 20 days, and suddenly is stuck processing a
>>>>>> single batch, with no tuples being emitted nor tuples being persisted by
>>>>>> the TridentState (IBackingMap).
>>>>>>
>>>>>> It's a simple topology which consumes messages off a Kafka queue. The
>>>>>> spout is an instance of storm-kafka-0.8-plus 
>>>>>> TransactionalTridentKafkaSpout
>>>>>> and I use the trident-mssql transactional TridentState implementation to
>>>>>> persistentAggregate() data into a SQL database.
>>>>>>
>>>>>>  In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>>>>
>>>>>>      "/transactional/<myTopologyName>/coordinator/currattempts" is
>>>>>> "{"29698959":6487}"
>>>>>>
>>>>>> ... and the attempt count keeps increasing. It seems the batch with
>>>>>> txid 29698959 is stuck, as the attempt count in Zookeeper keeps 
>>>>>> increasing
>>>>>> -- seems like the batch isn't being acked by Trident and I have no idea
>>>>>> why, especially since the topology has been running successfully the last
>>>>>> 20 days.
>>>>>>
>>>>>>  I did rebalance the topology on one occasion, after which it
>>>>>> continued running normally. Other than that, no other modifications were
>>>>>> done. Storm is at version 0.9.0.1.
>>>>>>
>>>>>>  Any hints on how to debug the stuck topology? Any other useful info
>>>>>> I might provide?
>>>>>>
>>>>>>  Thanks,
>>>>>>
>>>>>> --
>>>>>> Danijel Schiavuzzi
>>>>>>
>>>>>> E: dani...@schiavuzzi.com
>>>>>> W: www.schiavuzzi.com
>>>>>> T: +385989035562
>>>>>> Skype: danijel.schiavuzzi
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Danijel Schiavuzzi
>>>>>
>>>>> E: dani...@schiavuzzi.com
>>>>> W: www.schiavuzzi.com
>>>>> T: +385989035562
>>>>> Skype: danijel.schiavuzzi
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Danijel Schiavuzzi
>>>>
>>>> E: dani...@schiavuzzi.com
>>>> W: www.schiavuzzi.com
>>>> T: +385989035562
>>>>  Skype: danijels7
>>>>
>>>
>>>
>>>
>>> --
>>> Danijel Schiavuzzi
>>>
>>> E: dani...@schiavuzzi.com
>>> W: www.schiavuzzi.com
>>> T: +385989035562
>>> Skype: danijels7
>>>
>>
>>
>>
>> --
>> Danijel Schiavuzzi
>>
>> E: dani...@schiavuzzi.com
>> W: www.schiavuzzi.com
>> T: +385989035562
>> Skype: danijels7
>>
>
>


-- 
Danijel Schiavuzzi

E: dani...@schiavuzzi.com
W: www.schiavuzzi.com
T: +385989035562
Skype: danijels7

Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

Reply via email to