Thanks for reply.can I pulll this fix or can I download it? On 17 Sep 2014 09:58, tarkeshwa...@gmail.com wrote:
In which version it is available. On 16 Sep 2014 19:01, "Danijel Schiavuzzi" <dani...@schiavuzzi.com> wrote: Yes, it's been fixed in 'master' for some time now. Danijel On Tuesday, September 16, 2014, M.Tarkeshwar Rao <tarkeshwa...@gmail.com> wrote: > Hi Danijel, > > Is the issue resolved in any version of the storm? > > Regards > Tarkeshwar > > On Thu, Jul 17, 2014 at 6:57 PM, Danijel Schiavuzzi < > dani...@schiavuzzi.com> wrote: > >> I've filled a bug report for this under >> https://issues.apache.org/jira/browse/STORM-406 >> >> The issue is 100% reproducible with, it seems, any Trident topology and >> across multiple Storm versions with Netty transport enabled. 0MQ is working >> fine. You can try with TridentWordCount from storm-starter, for example. >> >> Your insight seems correct: when the killed worker re-spawns on the same >> slot (port), the topology stops processing. See the above JIRA for >> additional info. >> >> Danijel >> >> >> >> >> On Thu, Jul 17, 2014 at 7:20 AM, M.Tarkeshwar Rao <tarkeshwa...@gmail.com >> > wrote: >> >>> Thanks Danijel for helping me. >>> >>> >>> On Thu, Jul 17, 2014 at 1:37 AM, Danijel Schiavuzzi < >>> dani...@schiavuzzi.com> wrote: >>> >>>> I see no issues with your cluster configuration. >>>> >>>> You should definitely share the (simplified if possible) topology >>>> code and the steps to reproduce the blockage, better yet you should file a >>>> JIRA task on Apache's JIRA web -- be sure to include your Trident >>>> internals modifications. >>>> >>>> Unfortunately, seems I'm having the same issues now with Storm 0.9.2 >>>> too, so I might get back here with some updates soon. It's not so fast >>>> and easily reproducible as it was under 0.9.1, but the bug >>>> seems nonetheless still present. I'll reduce the number of Storm slots and >>>> topology workers as per your insights, hopefully this might make it easier >>>> to reproduce the bug with a simplified Trident topology. >>>> >>>> >>>> On Tuesday, July 15, 2014, M.Tarkeshwar Rao <tarkeshwa...@gmail.com> >>>> wrote: >>>> >>>>> Hi Denijel, >>>>> >>>>> We have done few changes in the the trident core framework code as per >>>>> our need which is working fine with zeromq. I am sharing configuration >>>>> which we are using. Can you please suggest our config is fine or not? >>>>> >>>>> Code part is so large so we are writing some sample topology and >>>>> trying to reproduce the issue, which we will share with you. >>>>> >>>>> What are the steps to reproduce the issue: >>>>> ------------------------------------------------------------- >>>>> >>>>> 1. we deployed our topology with one linux machine, two workers and >>>>> one acker with batch size 2. >>>>> 2. both the worker are up and start the processing. >>>>> 3. after few seconds i killed one of the worker kill -9. >>>>> 4. when the killed worker spawned on the same port it is getting >>>>> hanged. >>>>> 5. only retries going on. >>>>> 6. when the killed worker spawned on the another port everything >>>>> working fine. >>>>> >>>>> machine conf: >>>>> -------------------------- >>>>> [root@sb6270x1637-2 conf]# uname -a >>>>> >>>>> Linux bl460cx2378 2.6.32-431.5.1.el6.x86_64 #1 SMP Fri Jan 10 14:46:43 >>>>> EST 2014 x86_64 x86_64 x86_64 GNU/Linux >>>>> >>>>> >>>>> *storm.yaml* which we are using to launch nimbus, supervisor and ui >>>>> >>>>> ########## These MUST be filled in for a storm configuration >>>>> storm.zookeeper.servers: >>>>> - "10.61.244.86" >>>>> storm.zookeeper.port: 2000 >>>>> supervisor.slots.ports: >>>>> - 6788 >>>>> - 6789 >>>>> - 6800 >>>>> - 6801 >>>>> - 6802 >>>>> - 6803 >>>>> >>>>> nimbus.host: "10.61.244.86" >>>>> >>>>> >>>>> storm.messaging.transport: "backtype.storm.messaging.netty.Context" >>>>> >>>>> storm.messaging.netty.server_worker_threads: 10 >>>>> storm.messaging.netty.client_worker_threads: 10 >>>>> storm.messaging.netty.buffer_size: 5242880 >>>>> storm.messaging.netty.max_retries: 100 >>>>> storm.messaging.netty.max_wait_ms: 1000 >>>>> storm.messaging.netty.min_wait_ms: 100 >>>>> storm.local.dir: "/root/home_98/home/enavgoy/storm-local" >>>>> storm.scheduler: "com.ericsson.storm.scheduler.TopologyScheduler" >>>>> topology.acker.executors: 1 >>>>> topology.message.timeout.secs: 30 >>>>> supervisor.scheduler.meta: >>>>> name: "supervisor1" >>>>> >>>>> >>>>> worker.childopts: "-Xmx2048m" >>>>> >>>>> mm.hdfs.ipaddress: "10.61.244.7" >>>>> mm.hdfs.port: 9000 >>>>> topology.batch.size: 2 >>>>> topology.batch.timeout: 10000 >>>>> topology.workers: 2 >>>>> topology.debug: true >>>>> >>>>> Regards >>>>> Tarkeshwar >>>>> >>>>> >>>>> >>>>> On Mon, Jul 7, 2014 at 1:22 PM, Danijel Schiavuzzi < >>>>> dani...@schiavuzzi.com> wrote: >>>>> >>>>>> Hi Tarkeshwar, >>>>>> >>>>>> Could you provide a code sample of your topology? Do you have any >>>>>> special configs enabled? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Danijel >>>>>> >>>>>> >>>>>> On Mon, Jul 7, 2014 at 9:01 AM, M.Tarkeshwar Rao < >>>>>> tarkeshwa...@gmail.com> wrote: >>>>>> >>>>>>> Hi Danijel, >>>>>>> >>>>>>> We are able to reproduce this issue with 0.9.2 as well. >>>>>>> We have two worker setup to run the trident topology. >>>>>>> >>>>>>> When we kill one of the worker and again when that killed worker >>>>>>> spawn on same port(same slot) then that worker not able to communicate >>>>>>> with >>>>>>> 2nd worker. >>>>>>> >>>>>>> only transaction attempts are increasing continuously. >>>>>>> >>>>>>> But if the killed worker spawn on new slot(new communication port) >>>>>>> then it working fine. Same behavior as in storm 9.0.1. >>>>>>> >>>>>>> Please update me if you get any new development. >>>>>>> >>>>>>> Regards >>>>>>> Tarkeshwar >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 3, 2014 at 7:06 PM, Danijel Schiavuzzi < >>>>>>> dani...@schiavuzzi.com> wrote: >>>>>>> >>>>>>>> Hi Bobby, >>>>>>>> >>>>>>>> Just an update on the stuck Trident transactional topology issue -- >>>>>>>> I've upgraded to Storm 0.9.2-incubating (from 0.9.1-incubating) and >>>>>>>> can't >>>>>>>> reproduce the bug anymore. Will keep you posted if any issues arise. >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Danijel >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jun 16, 2014 at 7:56 PM, Bobby Evans <ev...@yahoo-inc.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I have not seen this before, if you could file a JIRA on this >>>>>>>>> that would be great. >>>>>>>>> >>>>>>>>> - Bobby >>>>>>>>> >>>>>>>>> From: Danijel Schiavuzzi <dani...@schiavuzzi.com> >>>>>>>>> Reply-To: "user@storm.incubator.apache.org" < >>>>>>>>> user@storm.incubator.apache.org> >>>>>>>>> Date: Wednesday, June 4, 2014 at 10:30 AM >>>>>>>>> To: "user@storm.incubator.apache.org" < >>>>>>>>> user@storm.incubator.apache.org>, "d...@storm.incubator.apache.org" >>>>>>>>> <d...@storm.incubator.apache.org> >>>>>>>>> Subject: Trident transactional topology stuck re-emitting batches >>>>>>>>> with Netty, but running fine with ZMQ (was Re: Topology is stuck) >>>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> I've managed to reproduce the stuck topology problem and it seems >>>>>>>>> it's due to the Netty transport. Running with ZMQ transport enabled >>>>>>>>> now and >>>>>>>>> I haven't been able to reproduce this. >>>>>>>>> >>>>>>>>> The problem is basically a Trident/Kafka transactional topology >>>>>>>>> getting stuck, i.e. re-emitting the same batches over and over again. >>>>>>>>> This >>>>>>>>> happens after the Storm workers restart a few times due to Kafka spout >>>>>>>>> throwing RuntimeExceptions (because of the Kafka consumer in the spout >>>>>>>>> timing out with a SocketTimeoutException due to some temporary network >>>>>>>>> problems). Sometimes the topology is stuck after just one worker is >>>>>>>>> restarted, and sometimes a few worker restarts are needed to trigger >>>>>>>>> the >>>>>>>>> problem. >>>>>>>>> >>>>>>>>> I simulated the Kafka spout socket timeouts by blocking network >>>>>>>>> access from Storm to my Kafka machines (with an iptables firewall >>>>>>>>> rule). >>>>>>>>> Most of the time the spouts (workers) would restart normally (after >>>>>>>>> re-enabling access to Kafka) and the topology would continue to >>>>>>>>> process >>>>>>>>> batches, but sometimes the topology would get stuck re-emitting >>>>>>>>> batches >>>>>>>>> after the crashed workers restarted. Killing and re-submitting the >>>>>>>>> topology >>>>>>>>> manually fixes this always, and processing continues normally. >>>>>>>>> >>>>>>>>> I haven't been able to reproduce this scenario after reverting >>>>>>>>> my Storm cluster's transport to ZeroMQ. With Netty transport, I can >>>>>>>>> almost >>>>>>>>> always reproduce the problem by causing a worker to restart a number >>>>>>>>> of >>>>>>>>> times (only about 4-5 worker restarts are enough to trigger this). >>>>>>>>> >>>>>>>>> Any hints on this? Anyone had the same problem? It does seem a >>>>>>>>> serious issue as it affect the reliability and fault tolerance of the >>>>>>>>> Storm >>>>>>>>> cluster. >>>>>>>>> >>>>>>>>> In the meantime, I'll try to prepare a reproducible test case >>>>>>>>> for this. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Danijel >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi < >>>>>>>>> dani...@schiavuzzi.com> wrote: >>>>>>>>> >>>>>>>>>> To (partially) answer my own question -- I still have no idea on >>>>>>>>>> the cause of the stuck topology, but re-submitting the topology >>>>>>>>>> helps -- >>>>>>>>>> after re-submitting my topology is now running normally. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi < >>>>>>>>>> dani...@schiavuzzi.com> wrote: >>>>>>>>>> >>>>>>>>>>> Also, I did have multiple cases of my IBackingMap workers >>>>>>>>>>> dying (because of RuntimeExceptions) but successfully restarting >>>>>>>>>>> afterwards >>>>>>>>>>> (I throw RuntimeExceptions in the BackingMap implementation as my >>>>>>>>>>> strategy >>>>>>>>>>> in rare SQL database deadlock situations to force a worker restart >>>>>>>>>>> and to >>>>>>>>>>> fail+retry the batch). >>>>>>>>>>> >>>>>>>>>>> From the logs, one such IBackingMap worker death (and >>>>>>>>>>> subsequent restart) resulted in the Kafka spout re-emitting the >>>>>>>>>>> pending >>>>>>>>>>> tuple: >>>>>>>>>>> >>>>>>>>>>> 2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] >>>>>>>>>>> re-emitting batch, attempt 29698959:736 >>>>>>>>>>> >>>>>>>>>>> This is of course the normal behavior of a transactional >>>>>>>>>>> topology, but this is the first time I've encountered a case of a >>>>>>>>>>> batch >>>>>>>>>>> retrying indefinitely. This is especially suspicious since the >>>>>>>>>>> topology has >>>>>>>>>>> been running fine for 20 days straight, re-emitting batches and >>>>>>>>>>> restarting >>>>>>>>>>> IBackingMap workers quite a number of times. >>>>>>>>>>> >>>>>>>>>>> I can see in my IBackingMap backing SQL database that the batch >>>>>>>>>>> with the exact txid value 29698959 has been committed -- but I >>>>>>>>>>> suspect that >>>>>>>>>>> could come from another BackingMap, since there are two BackingMap >>>>>>>>>>> instances running (paralellismHint 2). >>>>>>>>>>> >>>>>>>>>>> However, I have no idea why the batch is being retried >>>>>>>>>>> indefinitely now nor why it hasn't been successfully acked by >>>>>>>>>>> Trident. >>>>>>>>>>> >>>>>>>>>>> Any suggestions on the area (topology component) to focus my >>>>>>>>>>> research on? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi < >>>>>>>>>>> dani...@schiavuzzi.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hello, >>>>>>>>>>>> >>>>>>>>>>>> I'm having problems with my transactional Trident topology. It >>>>>>>>>>>> has been running fine for about 20 days, and suddenly is stuck >>>>>>>>>>>> processing a >>>>>>>>>>>> single batch, with no tuples being emitted nor tuples being >>>>>>>>>>>> persisted by >>>>>>>>>>>> the TridentState (IBackingMap). >>>>>>>>>>>> >>>>>>>>>>>> It's a simple topology which consumes messages off a Kafka >>>>>>>>>>>> queue. The spout is an instance of storm-kafka-0.8-plus >>>>>>>>>>>> TransactionalTridentKafkaSpout and I use the trident-mssql >>>>>>>>>>>> transactional >>>>>>>>>>>> TridentState implementation to persistentAggregate() data into a >>>>>>>>>>>> SQL >>>>>>>>>>>> database. >>>>>>>>>>>> >>>>>>>>>>>> In Zookeeper I can see Storm is re-trying a batch, i.e. >>>>>>>>>>>> >>>>>>>>>>>> "/transactional/<myTopologyName>/coordinator/currattempts" >>>>>>>>>>>> is "{"29698959":6487}" >>>>>>>>>>>> >>>>>>>>>>>> ... and the attempt count keeps increasing. It seems the batch >>>>>>>>>>>> with txid 29698959 is stuck, as the attempt count in Zookeeper >>>>>>>>>>>> keeps >>>>>>>>>>>> increasing -- seems like the batch isn't being acked by Trident >>>>>>>>>>>> and I have >>>>>>>>>>>> no idea why, especially since the topology has been running >>>>>>>>>>>> successfully >>>>>>>>>>>> the last 20 days. >>>>>>>>>>>> >>>>>>>>>>>> I did rebalance the topology on one occasion, after which it >>>>>>>>>>>> continued running normally. Other than that, no other >>>>>>>>>>>> modifications were >>>>>>>>>>>> done. Storm is at version 0.9.0.1. >>>>>>>>>>>> >>>>>>>>>>>> Any hints on how to debug the stuck topology? Any other >>>>>>>>>>>> useful info I might provide? >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Danijel Schiavuzzi >>>>>>>>>>>> >>>>>>>>>>>> E: dani...@schiavuzzi.com >>>>>>>>>>>> W: www.schiavuzzi.com >>>>>>>>>>>> T: +385989035562 >>>>>>>>>>>> Skype: danijel.schiavuzzi >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Danijel Schiavuzzi >>>>>>>>>>> >>>>>>>>>>> E: dani...@schiavuzzi.com >>>>>>>>>>> W: www.schiavuzzi.com >>>>>>>>>>> T: +385989035562 >>>>>>>>>>> Skype: danijel.schiavuzzi >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Danijel Schiavuzzi >>>>>>>>>> >>>>>>>>>> E: dani...@schiavuzzi.com >>>>>>>>>> W: www.schiavuzzi.com >>>>>>>>>> T: +385989035562 >>>>>>>>>> Skype: danijels7 >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Danijel Schiavuzzi >>>>>>>>> >>>>>>>>> E: dani...@schiavuzzi.com >>>>>>>>> W: www.schiavuzzi.com >>>>>>>>> T: +385989035562 >>>>>>>>> Skype: danijels7 >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Danijel Schiavuzzi >>>>>>>> >>>>>>>> E: dani...@schiavuzzi.com >>>>>>>> W: www.schiavuzzi.com >>>>>>>> T: +385989035562 >>>>>>>> Skype: danijels7 >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Danijel Schiavuzzi >>>>>> >>>>>> E: dani...@schiavuzzi.com >>>>>> W: www.schiavuzzi.com >>>>>> T: +385989035562 >>>>>> Skype: danijels7 >>>>>> >>>>> >>>>> >>>> >>>> -- >>>> Danijel Schiavuzzi >>>> >>>> E: dani...@schiavuzzi.com >>>> W: www.schiavuzzi.com >>>> T: +385989035562 >>>> Skype: danijels7 >>>> >>> >>> >> >> >> -- >> Danijel Schiavuzzi >> >> E: dani...@schiavuzzi.com >> W: www.schiavuzzi.com >> T: +385989035562 >> Skype: danijels7 >> > > -- Danijel Schiavuzzi E: dani...@schiavuzzi.com W: www.schiavuzzi.com T: +385 98 9035562 Skype: danijel.schiavuzzi