Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

M.Tarkeshwar Rao Tue, 16 Sep 2014 21:32:49 -0700

Thanks for reply.can  I pulll this fix or can I download it?
On 17 Sep 2014 09:58, tarkeshwa...@gmail.com wrote:


In which version it is available.
On 16 Sep 2014 19:01, "Danijel Schiavuzzi" <dani...@schiavuzzi.com> wrote:

Yes, it's been fixed in 'master' for some time now.

Danijel


On Tuesday, September 16, 2014, M.Tarkeshwar Rao <tarkeshwa...@gmail.com>
wrote:

> Hi Danijel,
>
> Is the issue resolved in any version of the storm?
>
> Regards
> Tarkeshwar
>
> On Thu, Jul 17, 2014 at 6:57 PM, Danijel Schiavuzzi <
> dani...@schiavuzzi.com> wrote:
>
>> I've filled a bug report for this under
>> https://issues.apache.org/jira/browse/STORM-406
>>
>> The issue is 100% reproducible with, it seems, any Trident topology and
>> across multiple Storm versions with Netty transport enabled. 0MQ is working
>> fine. You can try with TridentWordCount from storm-starter, for example.
>>
>> Your insight seems correct: when the killed worker re-spawns on the same
>> slot (port), the topology stops processing. See the above JIRA for
>> additional info.
>>
>> Danijel
>>
>>
>>
>>
>> On Thu, Jul 17, 2014 at 7:20 AM, M.Tarkeshwar Rao <tarkeshwa...@gmail.com
>> > wrote:
>>
>>> Thanks Danijel for helping me.
>>>
>>>
>>> On Thu, Jul 17, 2014 at 1:37 AM, Danijel Schiavuzzi <
>>> dani...@schiavuzzi.com> wrote:
>>>
>>>> I see no issues with your cluster configuration.
>>>>
>>>> You should definitely share the (simplified if possible) topology
>>>> code and the steps to reproduce the blockage, better yet you should file a
>>>> JIRA task on Apache's JIRA web -- be sure to include your Trident
>>>> internals modifications.
>>>>
>>>> Unfortunately, seems I'm having the same issues now with Storm 0.9.2
>>>> too, so I might get back here with some updates soon. It's not so fast
>>>> and easily reproducible as it was under 0.9.1, but the bug
>>>> seems nonetheless still present. I'll reduce the number of Storm slots and
>>>> topology workers as per your insights, hopefully this might make it easier
>>>> to reproduce the bug with a simplified Trident topology.
>>>>
>>>>
>>>> On Tuesday, July 15, 2014, M.Tarkeshwar Rao <tarkeshwa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Denijel,
>>>>>
>>>>> We have done few changes in the the trident core framework code as per
>>>>> our need which is working fine with zeromq. I am sharing configuration
>>>>> which we are using. Can you please suggest our config is fine or not?
>>>>>
>>>>>  Code part is so large so we are writing some sample topology and
>>>>> trying to reproduce the issue, which we will share with you.
>>>>>
>>>>> What are the steps to reproduce the issue:
>>>>>  -------------------------------------------------------------
>>>>>
>>>>> 1. we deployed our topology with one linux machine, two workers and
>>>>> one acker with batch size 2.
>>>>> 2. both the worker are up and start the processing.
>>>>> 3. after few seconds i killed one of the worker kill -9.
>>>>> 4. when the killed worker spawned on the same port it is getting
>>>>> hanged.
>>>>> 5. only retries going on.
>>>>> 6. when the killed worker spawned on the another port everything
>>>>> working fine.
>>>>>
>>>>> machine conf:
>>>>> --------------------------
>>>>> [root@sb6270x1637-2 conf]# uname -a
>>>>>
>>>>> Linux bl460cx2378 2.6.32-431.5.1.el6.x86_64 #1 SMP Fri Jan 10 14:46:43
>>>>> EST 2014 x86_64 x86_64 x86_64 GNU/Linux
>>>>>
>>>>>
>>>>> *storm.yaml* which we are using to launch  nimbus, supervisor and ui
>>>>>
>>>>> ########## These MUST be filled in for a storm configuration
>>>>>  storm.zookeeper.servers:
>>>>>      - "10.61.244.86"
>>>>>  storm.zookeeper.port: 2000
>>>>>  supervisor.slots.ports:
>>>>>     - 6788
>>>>>     - 6789
>>>>>     - 6800
>>>>>     - 6801
>>>>>     - 6802
>>>>>      - 6803
>>>>>
>>>>>  nimbus.host: "10.61.244.86"
>>>>>
>>>>>
>>>>>  storm.messaging.transport: "backtype.storm.messaging.netty.Context"
>>>>>
>>>>>  storm.messaging.netty.server_worker_threads: 10
>>>>>  storm.messaging.netty.client_worker_threads: 10
>>>>>  storm.messaging.netty.buffer_size: 5242880
>>>>>  storm.messaging.netty.max_retries: 100
>>>>>  storm.messaging.netty.max_wait_ms: 1000
>>>>>  storm.messaging.netty.min_wait_ms: 100
>>>>>  storm.local.dir: "/root/home_98/home/enavgoy/storm-local"
>>>>>  storm.scheduler: "com.ericsson.storm.scheduler.TopologyScheduler"
>>>>>  topology.acker.executors: 1
>>>>>  topology.message.timeout.secs: 30
>>>>>  supervisor.scheduler.meta:
>>>>>       name: "supervisor1"
>>>>>
>>>>>
>>>>>  worker.childopts: "-Xmx2048m"
>>>>>
>>>>>  mm.hdfs.ipaddress: "10.61.244.7"
>>>>>  mm.hdfs.port: 9000
>>>>>  topology.batch.size: 2
>>>>>  topology.batch.timeout: 10000
>>>>>  topology.workers: 2
>>>>>  topology.debug: true
>>>>>
>>>>> Regards
>>>>> Tarkeshwar
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jul 7, 2014 at 1:22 PM, Danijel Schiavuzzi <
>>>>> dani...@schiavuzzi.com> wrote:
>>>>>
>>>>>> Hi Tarkeshwar,
>>>>>>
>>>>>> Could you provide a code sample of your topology? Do you have any
>>>>>> special configs enabled?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Danijel
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 7, 2014 at 9:01 AM, M.Tarkeshwar Rao <
>>>>>> tarkeshwa...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Danijel,
>>>>>>>
>>>>>>> We are able to reproduce this issue with 0.9.2 as well.
>>>>>>> We have two worker setup to run the trident topology.
>>>>>>>
>>>>>>> When we kill one of the worker and again when that killed worker
>>>>>>> spawn on same port(same slot) then that worker not able to communicate 
>>>>>>> with
>>>>>>> 2nd worker.
>>>>>>>
>>>>>>> only transaction attempts are increasing continuously.
>>>>>>>
>>>>>>> But if the killed worker spawn on new slot(new communication port)
>>>>>>> then it working fine. Same behavior as in storm 9.0.1.
>>>>>>>
>>>>>>> Please update me if you get any new development.
>>>>>>>
>>>>>>> Regards
>>>>>>> Tarkeshwar
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 3, 2014 at 7:06 PM, Danijel Schiavuzzi <
>>>>>>> dani...@schiavuzzi.com> wrote:
>>>>>>>
>>>>>>>> Hi Bobby,
>>>>>>>>
>>>>>>>> Just an update on the stuck Trident transactional topology issue --
>>>>>>>> I've upgraded to Storm 0.9.2-incubating (from 0.9.1-incubating) and 
>>>>>>>> can't
>>>>>>>> reproduce the bug anymore. Will keep you posted if any issues arise.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Danijel
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jun 16, 2014 at 7:56 PM, Bobby Evans <ev...@yahoo-inc.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>  I have not seen this before, if you could file a JIRA on this
>>>>>>>>> that would be great.
>>>>>>>>>
>>>>>>>>>  - Bobby
>>>>>>>>>
>>>>>>>>>   From: Danijel Schiavuzzi <dani...@schiavuzzi.com>
>>>>>>>>> Reply-To: "user@storm.incubator.apache.org" <
>>>>>>>>> user@storm.incubator.apache.org>
>>>>>>>>> Date: Wednesday, June 4, 2014 at 10:30 AM
>>>>>>>>> To: "user@storm.incubator.apache.org" <
>>>>>>>>> user@storm.incubator.apache.org>, "d...@storm.incubator.apache.org"
>>>>>>>>> <d...@storm.incubator.apache.org>
>>>>>>>>> Subject: Trident transactional topology stuck re-emitting batches
>>>>>>>>> with Netty, but running fine with ZMQ (was Re: Topology is stuck)
>>>>>>>>>
>>>>>>>>>   Hi all,
>>>>>>>>>
>>>>>>>>> I've managed to reproduce the stuck topology problem and it seems
>>>>>>>>> it's due to the Netty transport. Running with ZMQ transport enabled 
>>>>>>>>> now and
>>>>>>>>> I haven't been able to reproduce this.
>>>>>>>>>
>>>>>>>>>  The problem is basically a Trident/Kafka transactional topology
>>>>>>>>> getting stuck, i.e. re-emitting the same batches over and over again. 
>>>>>>>>> This
>>>>>>>>> happens after the Storm workers restart a few times due to Kafka spout
>>>>>>>>> throwing RuntimeExceptions (because of the Kafka consumer in the spout
>>>>>>>>> timing out with a SocketTimeoutException due to some temporary network
>>>>>>>>> problems). Sometimes the topology is stuck after just one worker is
>>>>>>>>> restarted, and sometimes a few worker restarts are needed to trigger 
>>>>>>>>> the
>>>>>>>>> problem.
>>>>>>>>>
>>>>>>>>> I simulated the Kafka spout socket timeouts by blocking network
>>>>>>>>> access from Storm to my Kafka machines (with an iptables firewall 
>>>>>>>>> rule).
>>>>>>>>> Most of the time the spouts (workers) would restart normally (after
>>>>>>>>> re-enabling access to Kafka) and the topology would continue to 
>>>>>>>>> process
>>>>>>>>> batches, but sometimes the topology would get stuck re-emitting 
>>>>>>>>> batches
>>>>>>>>> after the crashed workers restarted. Killing and re-submitting the 
>>>>>>>>> topology
>>>>>>>>> manually fixes this always, and processing continues normally.
>>>>>>>>>
>>>>>>>>>  I haven't been able to reproduce this scenario after reverting
>>>>>>>>> my Storm cluster's transport to ZeroMQ. With Netty transport, I can 
>>>>>>>>> almost
>>>>>>>>> always reproduce the problem by causing a worker to restart a number 
>>>>>>>>> of
>>>>>>>>> times (only about 4-5 worker restarts are enough to trigger this).
>>>>>>>>>
>>>>>>>>>  Any hints on this? Anyone had the same problem? It does seem a
>>>>>>>>> serious issue as it affect the reliability and fault tolerance of the 
>>>>>>>>> Storm
>>>>>>>>> cluster.
>>>>>>>>>
>>>>>>>>>  In the meantime, I'll try to prepare a reproducible test case
>>>>>>>>> for this.
>>>>>>>>>
>>>>>>>>>  Thanks,
>>>>>>>>>
>>>>>>>>> Danijel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi <
>>>>>>>>> dani...@schiavuzzi.com> wrote:
>>>>>>>>>
>>>>>>>>>> To (partially) answer my own question -- I still have no idea on
>>>>>>>>>> the cause of the stuck topology, but re-submitting the topology 
>>>>>>>>>> helps --
>>>>>>>>>> after re-submitting my topology is now running normally.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <
>>>>>>>>>> dani...@schiavuzzi.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>  Also, I did have multiple cases of my IBackingMap workers
>>>>>>>>>>> dying (because of RuntimeExceptions) but successfully restarting 
>>>>>>>>>>> afterwards
>>>>>>>>>>> (I throw RuntimeExceptions in the BackingMap implementation as my 
>>>>>>>>>>> strategy
>>>>>>>>>>> in rare SQL database deadlock situations to force a worker restart 
>>>>>>>>>>> and to
>>>>>>>>>>> fail+retry the batch).
>>>>>>>>>>>
>>>>>>>>>>>  From the logs, one such IBackingMap worker death (and
>>>>>>>>>>> subsequent restart) resulted in the Kafka spout re-emitting the 
>>>>>>>>>>> pending
>>>>>>>>>>> tuple:
>>>>>>>>>>>
>>>>>>>>>>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO]
>>>>>>>>>>> re-emitting batch, attempt 29698959:736
>>>>>>>>>>>
>>>>>>>>>>>  This is of course the normal behavior of a transactional
>>>>>>>>>>> topology, but this is the first time I've encountered a case of a 
>>>>>>>>>>> batch
>>>>>>>>>>> retrying indefinitely. This is especially suspicious since the 
>>>>>>>>>>> topology has
>>>>>>>>>>> been running fine for 20 days straight, re-emitting batches and 
>>>>>>>>>>> restarting
>>>>>>>>>>> IBackingMap workers quite a number of times.
>>>>>>>>>>>
>>>>>>>>>>> I can see in my IBackingMap backing SQL database that the batch
>>>>>>>>>>> with the exact txid value 29698959 has been committed -- but I 
>>>>>>>>>>> suspect that
>>>>>>>>>>> could come from another BackingMap, since there are two BackingMap
>>>>>>>>>>> instances running (paralellismHint 2).
>>>>>>>>>>>
>>>>>>>>>>>  However, I have no idea why the batch is being retried
>>>>>>>>>>> indefinitely now nor why it hasn't been successfully acked by 
>>>>>>>>>>> Trident.
>>>>>>>>>>>
>>>>>>>>>>> Any suggestions on the area (topology component) to focus my
>>>>>>>>>>> research on?
>>>>>>>>>>>
>>>>>>>>>>>  Thanks,
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <
>>>>>>>>>>> dani...@schiavuzzi.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>   Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> I'm having problems with my transactional Trident topology. It
>>>>>>>>>>>> has been running fine for about 20 days, and suddenly is stuck 
>>>>>>>>>>>> processing a
>>>>>>>>>>>> single batch, with no tuples being emitted nor tuples being 
>>>>>>>>>>>> persisted by
>>>>>>>>>>>> the TridentState (IBackingMap).
>>>>>>>>>>>>
>>>>>>>>>>>> It's a simple topology which consumes messages off a Kafka
>>>>>>>>>>>> queue. The spout is an instance of storm-kafka-0.8-plus
>>>>>>>>>>>> TransactionalTridentKafkaSpout and I use the trident-mssql 
>>>>>>>>>>>> transactional
>>>>>>>>>>>> TridentState implementation to persistentAggregate() data into a 
>>>>>>>>>>>> SQL
>>>>>>>>>>>> database.
>>>>>>>>>>>>
>>>>>>>>>>>>  In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>>>>>>>>>>
>>>>>>>>>>>>      "/transactional/<myTopologyName>/coordinator/currattempts"
>>>>>>>>>>>> is "{"29698959":6487}"
>>>>>>>>>>>>
>>>>>>>>>>>> ... and the attempt count keeps increasing. It seems the batch
>>>>>>>>>>>> with txid 29698959 is stuck, as the attempt count in Zookeeper 
>>>>>>>>>>>> keeps
>>>>>>>>>>>> increasing -- seems like the batch isn't being acked by Trident 
>>>>>>>>>>>> and I have
>>>>>>>>>>>> no idea why, especially since the topology has been running 
>>>>>>>>>>>> successfully
>>>>>>>>>>>> the last 20 days.
>>>>>>>>>>>>
>>>>>>>>>>>>  I did rebalance the topology on one occasion, after which it
>>>>>>>>>>>> continued running normally. Other than that, no other 
>>>>>>>>>>>> modifications were
>>>>>>>>>>>> done. Storm is at version 0.9.0.1.
>>>>>>>>>>>>
>>>>>>>>>>>>  Any hints on how to debug the stuck topology? Any other
>>>>>>>>>>>> useful info I might provide?
>>>>>>>>>>>>
>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>>>
>>>>>>>>>>>> E: dani...@schiavuzzi.com
>>>>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>>>>> T: +385989035562
>>>>>>>>>>>> Skype: danijel.schiavuzzi
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>>
>>>>>>>>>>> E: dani...@schiavuzzi.com
>>>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>>>> T: +385989035562
>>>>>>>>>>> Skype: danijel.schiavuzzi
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>
>>>>>>>>>> E: dani...@schiavuzzi.com
>>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>>> T: +385989035562
>>>>>>>>>>  Skype: danijels7
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>
>>>>>>>>> E: dani...@schiavuzzi.com
>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>> T: +385989035562
>>>>>>>>> Skype: danijels7
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Danijel Schiavuzzi
>>>>>>>>
>>>>>>>> E: dani...@schiavuzzi.com
>>>>>>>> W: www.schiavuzzi.com
>>>>>>>> T: +385989035562
>>>>>>>> Skype: danijels7
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Danijel Schiavuzzi
>>>>>>
>>>>>> E: dani...@schiavuzzi.com
>>>>>> W: www.schiavuzzi.com
>>>>>> T: +385989035562
>>>>>> Skype: danijels7
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Danijel Schiavuzzi
>>>>
>>>> E: dani...@schiavuzzi.com
>>>> W: www.schiavuzzi.com
>>>> T: +385989035562
>>>> Skype: danijels7
>>>>
>>>
>>>
>>
>>
>> --
>> Danijel Schiavuzzi
>>
>> E: dani...@schiavuzzi.com
>> W: www.schiavuzzi.com
>> T: +385989035562
>> Skype: danijels7
>>
>
>

-- 
Danijel Schiavuzzi

E: dani...@schiavuzzi.com
W: www.schiavuzzi.com
T: +385 98 9035562
Skype: danijel.schiavuzzi

Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

Reply via email to