Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

Bobby Evans Mon, 16 Jun 2014 10:58:33 -0700

I have not seen this before, if you could file a JIRA on this that would be 
great.

- Bobby

From: Danijel Schiavuzzi <dani...@schiavuzzi.com<mailto:dani...@schiavuzzi.com>>
Reply-To: 
"user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org>" 
<user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org>>
Date: Wednesday, June 4, 2014 at 10:30 AM
To: "user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org>" 
<user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org>>, 
"d...@storm.incubator.apache.org<mailto:d...@storm.incubator.apache.org>" 
<d...@storm.incubator.apache.org<mailto:d...@storm.incubator.apache.org>>
Subject: Trident transactional topology stuck re-emitting batches with Netty, 
but running fine with ZMQ (was Re: Topology is stuck)

Hi all,

I've managed to reproduce the stuck topology problem and it seems it's due to 
the Netty transport. Running with ZMQ transport enabled now and I haven't been 
able to reproduce this.

The problem is basically a Trident/Kafka transactional topology getting stuck, 
i.e. re-emitting the same batches over and over again. This happens after the 
Storm workers restart a few times due to Kafka spout throwing RuntimeExceptions 
(because of the Kafka consumer in the spout timing out with a 
SocketTimeoutException due to some temporary network problems). Sometimes the 
topology is stuck after just one worker is restarted, and sometimes a few 
worker restarts are needed to trigger the problem.

I simulated the Kafka spout socket timeouts by blocking network access from 
Storm to my Kafka machines (with an iptables firewall rule). Most of the time 
the spouts (workers) would restart normally (after re-enabling access to Kafka) 
and the topology would continue to process batches, but sometimes the topology 
would get stuck re-emitting batches after the crashed workers restarted. 
Killing and re-submitting the topology manually fixes this always, and 
processing continues normally.

I haven't been able to reproduce this scenario after reverting my Storm 
cluster's transport to ZeroMQ. With Netty transport, I can almost always 
reproduce the problem by causing a worker to restart a number of times (only 
about 4-5 worker restarts are enough to trigger this).

Any hints on this? Anyone had the same problem? It does seem a serious issue as 
it affect the reliability and fault tolerance of the Storm cluster.

In the meantime, I'll try to prepare a reproducible test case for this.

Thanks,

Danijel

On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi 
<dani...@schiavuzzi.com<mailto:dani...@schiavuzzi.com>> wrote:
To (partially) answer my own question -- I still have no idea on the cause of 
the stuck topology, but re-submitting the topology helps -- after re-submitting 
my topology is now running normally.

On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi 
<dani...@schiavuzzi.com<mailto:dani...@schiavuzzi.com>> wrote:
Also, I did have multiple cases of my IBackingMap workers dying (because of 
RuntimeExceptions) but successfully restarting afterwards (I throw 
RuntimeExceptions in the BackingMap implementation as my strategy in rare SQL 
database deadlock situations to force a worker restart and to fail+retry the 
batch).

>From the logs, one such IBackingMap worker death (and subsequent restart) 
>resulted in the Kafka spout re-emitting the pending tuple:

    2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting batch, 
attempt 29698959:736

This is of course the normal behavior of a transactional topology, but this is 
the first time I've encountered a case of a batch retrying indefinitely. This 
is especially suspicious since the topology has been running fine for 20 days 
straight, re-emitting batches and restarting IBackingMap workers quite a number 
of times.

I can see in my IBackingMap backing SQL database that the batch with the exact 
txid value 29698959 has been committed -- but I suspect that could come from 
another BackingMap, since there are two BackingMap instances running 
(paralellismHint 2).

However, I have no idea why the batch is being retried indefinitely now nor why 
it hasn't been successfully acked by Trident.

Any suggestions on the area (topology component) to focus my research on?

Thanks,

On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi 
<dani...@schiavuzzi.com<mailto:dani...@schiavuzzi.com>> wrote:
Hello,

I'm having problems with my transactional Trident topology. It has been running 
fine for about 20 days, and suddenly is stuck processing a single batch, with 
no tuples being emitted nor tuples being persisted by the TridentState 
(IBackingMap).

It's a simple topology which consumes messages off a Kafka queue. The spout is 
an instance of storm-kafka-0.8-plus TransactionalTridentKafkaSpout and I use 
the trident-mssql transactional TridentState implementation to 
persistentAggregate() data into a SQL database.

In Zookeeper I can see Storm is re-trying a batch, i.e.

     "/transactional/<myTopologyName>/coordinator/currattempts" is 
"{"29698959":6487}"

... and the attempt count keeps increasing. It seems the batch with txid 
29698959 is stuck, as the attempt count in Zookeeper keeps increasing -- seems 
like the batch isn't being acked by Trident and I have no idea why, especially 
since the topology has been running successfully the last 20 days.

I did rebalance the topology on one occasion, after which it continued running 
normally. Other than that, no other modifications were done. Storm is at 
version 0.9.0.1.

Any hints on how to debug the stuck topology? Any other useful info I might 
provide?

Thanks,

--
Danijel Schiavuzzi

E: dani...@schiavuzzi.com<mailto:dani...@schiavuzzi.com>
W: www.schiavuzzi.com<http://www.schiavuzzi.com/>
T: +385989035562
Skype: danijel.schiavuzzi

--
Danijel Schiavuzzi

E: dani...@schiavuzzi.com<mailto:dani...@schiavuzzi.com>
W: www.schiavuzzi.com<http://www.schiavuzzi.com/>
T: +385989035562
Skype: danijel.schiavuzzi

--
Danijel Schiavuzzi

E: dani...@schiavuzzi.com<mailto:dani...@schiavuzzi.com>
W: www.schiavuzzi.com<http://www.schiavuzzi.com/>
T: +385989035562
Skype: danijels7

--
Danijel Schiavuzzi

E: dani...@schiavuzzi.com<mailto:dani...@schiavuzzi.com>
W: www.schiavuzzi.com<http://www.schiavuzzi.com/>
T: +385989035562
Skype: danijels7

Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

Reply via email to