The value for topology.max.spout.pending is 1000 currently. I did decrease it previously to understand the effect of that value on my problem. Clearly, throughput dropped, but still a very high rate of failure!
On Sat, Apr 22, 2017 at 3:12 AM, Casey Stella <ceste...@gmail.com> wrote: > Ok, so ignoring the indexing topology, the fact that you're seeing > failures in the enrichment topology, which has no ES component, is > telling. It's also telling that the enrichment topology stats are > perfectly sensible latency-wise (i.e. it's not sweating). > > What's your storm configuration for topology.max.spout.pending? If it's > not set, then try setting it to 1000 and bouncing the topologies. > > On Fri, Apr 21, 2017 at 12:54 PM, Ali Nazemian <alinazem...@gmail.com> > wrote: > >> No, nothing ... >> >> On Sat, Apr 22, 2017 at 2:46 AM, Casey Stella <ceste...@gmail.com> wrote: >> >>> Anything going on in the kafka broker logs? >>> >>> On Fri, Apr 21, 2017 at 12:24 PM, Ali Nazemian <alinazem...@gmail.com> >>> wrote: >>> >>>> Although this is a test platform with a way less spec than production, >>>> it should be enough for indexing 600 docs per second. I have seen benchmark >>>> result of 150-200k docs per second with this spec! I haven't played with >>>> tuning the template yet, but I still think the current rate does not make >>>> sense at all. >>>> >>>> I have changed the batch size to 100. Throughput has been dropped, but >>>> still a very high rate of failure! >>>> >>>> Please find the screenshots for the enrichments: >>>> http://imgur.com/a/ceC8f >>>> http://imgur.com/a/sBQwM >>>> >>>> On Sat, Apr 22, 2017 at 2:08 AM, Casey Stella <ceste...@gmail.com> >>>> wrote: >>>> >>>>> Ok, yeah, those latencies are pretty high. I think what's happening >>>>> is that the tuples aren't being acked fast enough and are timing out. How >>>>> taxed is your ES box? Can you drop the batch size down to maybe 100 and >>>>> see what happens? >>>>> >>>>> On Fri, Apr 21, 2017 at 12:05 PM, Ali Nazemian <alinazem...@gmail.com> >>>>> wrote: >>>>> >>>>>> Please find the bolt part of Storm-UI related to indexing topology: >>>>>> >>>>>> http://imgur.com/a/tFkmO >>>>>> >>>>>> As you can see a hdfs error has also appeared which is not important >>>>>> right now. >>>>>> >>>>>> On Sat, Apr 22, 2017 at 1:59 AM, Casey Stella <ceste...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> What's curious is the enrichment topology showing the same issues, >>>>>>> but my mind went to ES as well. >>>>>>> >>>>>>> On Fri, Apr 21, 2017 at 11:57 AM, Ryan Merriman <merrim...@gmail.com >>>>>>> > wrote: >>>>>>> >>>>>>>> Yes which bolt is reporting all those failures? My theory is that >>>>>>>> there is some ES tuning that needs to be done. >>>>>>>> >>>>>>>> On Fri, Apr 21, 2017 at 10:53 AM, Casey Stella <ceste...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Could I see a little more of that screen? Specifically what the >>>>>>>>> bolts look like. >>>>>>>>> >>>>>>>>> On Fri, Apr 21, 2017 at 11:51 AM, Ali Nazemian < >>>>>>>>> alinazem...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Please find the storm-UI screenshot as follows. >>>>>>>>>> >>>>>>>>>> http://imgur.com/FhIrGFd >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, Apr 22, 2017 at 1:41 AM, Ali Nazemian < >>>>>>>>>> alinazem...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Casey, >>>>>>>>>>> >>>>>>>>>>> - topology.message.timeout: It was 30s at first. I have >>>>>>>>>>> increased it to 300s, no changes! >>>>>>>>>>> - It is a very basic geo-enrichment and simple rule for threat >>>>>>>>>>> triage! >>>>>>>>>>> - No, not at all. >>>>>>>>>>> - I have changed that to find the best value. it is 5000 which >>>>>>>>>>> is about to 5MB. >>>>>>>>>>> - I have changed the number of executors for the Storm acker >>>>>>>>>>> thread, and I have also changed the value of >>>>>>>>>>> topology.max.spout.pending, >>>>>>>>>>> still no changes! >>>>>>>>>>> >>>>>>>>>>> On Sat, Apr 22, 2017 at 1:24 AM, Casey Stella < >>>>>>>>>>> ceste...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Also, >>>>>>>>>>>> * what's your setting for topology.message.timeout? >>>>>>>>>>>> * You said you're seeing this in indexing and enrichment, what >>>>>>>>>>>> enrichments do you have in place? >>>>>>>>>>>> * Is ES being taxed heavily? >>>>>>>>>>>> * What's your ES batch size for the sensor? >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:46 AM, Casey Stella < >>>>>>>>>>>> ceste...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> So you're seeing failures in the storm topology but no errors >>>>>>>>>>>>> in the logs. Would you mind sending over a screenshot of the >>>>>>>>>>>>> indexing >>>>>>>>>>>>> topology from the storm UI? You might not be able to paste the >>>>>>>>>>>>> image on >>>>>>>>>>>>> the mailing list, so maybe an imgur link would be in order. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> >>>>>>>>>>>>> Casey >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:34 AM, Ali Nazemian < >>>>>>>>>>>>> alinazem...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Ryan, >>>>>>>>>>>>>> >>>>>>>>>>>>>> No, I cannot see any error inside the indexing error topic. >>>>>>>>>>>>>> Also, the number of tuples is emitted and transferred to the >>>>>>>>>>>>>> error indexing >>>>>>>>>>>>>> bolt is zero! >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, Apr 22, 2017 at 12:29 AM, Ryan Merriman < >>>>>>>>>>>>>> merrim...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Do you see any errors in the error* index in Elasticsearch? >>>>>>>>>>>>>>> There are several catch blocks across the different topologies >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>> transform errors into json objects and forward them on to the >>>>>>>>>>>>>>> indexing >>>>>>>>>>>>>>> topology. If you're not seeing anything in the worker logs >>>>>>>>>>>>>>> it's likely the >>>>>>>>>>>>>>> errors were captured there instead. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Ryan >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 9:19 AM, Ali Nazemian < >>>>>>>>>>>>>>> alinazem...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> No everything is fine at the log level. Also, when I >>>>>>>>>>>>>>>> checked resource consumption at the workers, there had been >>>>>>>>>>>>>>>> plenty >>>>>>>>>>>>>>>> resources still available! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:04 PM, Casey Stella < >>>>>>>>>>>>>>>> ceste...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Seeing anything in the storm logs for the workers? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 07:41 Ali Nazemian < >>>>>>>>>>>>>>>>> alinazem...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> After I tried to tune the Metron performance I have >>>>>>>>>>>>>>>>>> noticed the rate of failure for the indexing/enrichment >>>>>>>>>>>>>>>>>> topologies are very >>>>>>>>>>>>>>>>>> high (about 95%). However, I can see the messages in >>>>>>>>>>>>>>>>>> Elasticsearch. I have >>>>>>>>>>>>>>>>>> tried to increase the timeout value for the acknowledgement. >>>>>>>>>>>>>>>>>> It didn't fix >>>>>>>>>>>>>>>>>> the problem. I can set the number of acker executors to 0 to >>>>>>>>>>>>>>>>>> temporarily >>>>>>>>>>>>>>>>>> fix the problem which is not a good idea at all. Do you have >>>>>>>>>>>>>>>>>> any idea what >>>>>>>>>>>>>>>>>> have caused such issue? The percentage of failure decreases >>>>>>>>>>>>>>>>>> by reducing the >>>>>>>>>>>>>>>>>> number of parallelism, but even without any parallelism, it >>>>>>>>>>>>>>>>>> is still high! >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>> Ali >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> A.Nazemian >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> A.Nazemian >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> A.Nazemian >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> A.Nazemian >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> A.Nazemian >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> A.Nazemian >>>> >>> >>> >> >> >> -- >> A.Nazemian >> > > -- A.Nazemian