Re: High percentage of failed/timed out tuples after performance tuning!

Ali Nazemian Fri, 21 Apr 2017 21:30:07 -0700

The value for topology.max.spout.pending is 1000 currently. I did decrease
it previously to understand the effect of that value on my problem.
Clearly, throughput dropped, but still a very high rate of failure!


On Sat, Apr 22, 2017 at 3:12 AM, Casey Stella <ceste...@gmail.com> wrote:

> Ok, so ignoring the indexing topology, the fact that you're seeing
> failures in the enrichment topology, which has no ES component, is
> telling.  It's also telling that the enrichment topology stats are
> perfectly sensible latency-wise (i.e. it's not sweating).
>
> What's your storm configuration for topology.max.spout.pending?  If it's
> not set, then try setting it to 1000 and bouncing the topologies.
>
> On Fri, Apr 21, 2017 at 12:54 PM, Ali Nazemian <alinazem...@gmail.com>
> wrote:
>
>> No, nothing ...
>>
>> On Sat, Apr 22, 2017 at 2:46 AM, Casey Stella <ceste...@gmail.com> wrote:
>>
>>> Anything going on in the kafka broker logs?
>>>
>>> On Fri, Apr 21, 2017 at 12:24 PM, Ali Nazemian <alinazem...@gmail.com>
>>> wrote:
>>>
>>>> Although this is a test platform with a way less spec than production,
>>>> it should be enough for indexing 600 docs per second. I have seen benchmark
>>>> result of 150-200k docs per second with this spec! I haven't played with
>>>> tuning the template yet, but I still think the current rate does not make
>>>> sense at all.
>>>>
>>>> I have changed the batch size to 100. Throughput has been dropped, but
>>>> still a very high rate of failure!
>>>>
>>>> Please find the screenshots for the enrichments:
>>>> http://imgur.com/a/ceC8f
>>>> http://imgur.com/a/sBQwM
>>>>
>>>> On Sat, Apr 22, 2017 at 2:08 AM, Casey Stella <ceste...@gmail.com>
>>>> wrote:
>>>>
>>>>> Ok, yeah, those latencies are pretty high.  I think what's happening
>>>>> is that the tuples aren't being acked fast enough and are timing out.  How
>>>>> taxed is your ES box?  Can you drop the batch size down to maybe 100 and
>>>>> see what happens?
>>>>>
>>>>> On Fri, Apr 21, 2017 at 12:05 PM, Ali Nazemian <alinazem...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Please find the bolt part of Storm-UI related to indexing topology:
>>>>>>
>>>>>> http://imgur.com/a/tFkmO
>>>>>>
>>>>>> As you can see a hdfs error has also appeared which is not important
>>>>>> right now.
>>>>>>
>>>>>> On Sat, Apr 22, 2017 at 1:59 AM, Casey Stella <ceste...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> What's curious is the enrichment topology showing the same issues,
>>>>>>> but my mind went to ES as well.
>>>>>>>
>>>>>>> On Fri, Apr 21, 2017 at 11:57 AM, Ryan Merriman <merrim...@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Yes which bolt is reporting all those failures?  My theory is that
>>>>>>>> there is some ES tuning that needs to be done.
>>>>>>>>
>>>>>>>> On Fri, Apr 21, 2017 at 10:53 AM, Casey Stella <ceste...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Could I see a little more of that screen?  Specifically what the
>>>>>>>>> bolts look like.
>>>>>>>>>
>>>>>>>>> On Fri, Apr 21, 2017 at 11:51 AM, Ali Nazemian <
>>>>>>>>> alinazem...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Please find the storm-UI screenshot as follows.
>>>>>>>>>>
>>>>>>>>>> http://imgur.com/FhIrGFd
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Apr 22, 2017 at 1:41 AM, Ali Nazemian <
>>>>>>>>>> alinazem...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Casey,
>>>>>>>>>>>
>>>>>>>>>>> - topology.message.timeout: It was 30s at first. I have
>>>>>>>>>>> increased it to 300s, no changes!
>>>>>>>>>>> - It is a very basic geo-enrichment and simple rule for threat
>>>>>>>>>>> triage!
>>>>>>>>>>> - No, not at all.
>>>>>>>>>>> - I have changed that to find the best value. it is 5000 which
>>>>>>>>>>> is about to 5MB.
>>>>>>>>>>> - I have changed the number of executors for the Storm acker
>>>>>>>>>>> thread, and I have also changed the value of 
>>>>>>>>>>> topology.max.spout.pending,
>>>>>>>>>>> still no changes!
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Apr 22, 2017 at 1:24 AM, Casey Stella <
>>>>>>>>>>> ceste...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Also,
>>>>>>>>>>>> * what's your setting for topology.message.timeout?
>>>>>>>>>>>> * You said you're seeing this in indexing and enrichment, what
>>>>>>>>>>>> enrichments do you have in place?
>>>>>>>>>>>> * Is ES being taxed heavily?
>>>>>>>>>>>> * What's your ES batch size for the sensor?
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:46 AM, Casey Stella <
>>>>>>>>>>>> ceste...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> So you're seeing failures in the storm topology but no errors
>>>>>>>>>>>>> in the logs.  Would you mind sending over a screenshot of the 
>>>>>>>>>>>>> indexing
>>>>>>>>>>>>> topology from the storm UI?  You might not be able to paste the 
>>>>>>>>>>>>> image on
>>>>>>>>>>>>> the mailing list, so maybe an imgur link would be in order.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Casey
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:34 AM, Ali Nazemian <
>>>>>>>>>>>>> alinazem...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> No, I cannot see any error inside the indexing error topic.
>>>>>>>>>>>>>> Also, the number of tuples is emitted and transferred to the 
>>>>>>>>>>>>>> error indexing
>>>>>>>>>>>>>> bolt is zero!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Apr 22, 2017 at 12:29 AM, Ryan Merriman <
>>>>>>>>>>>>>> merrim...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Do you see any errors in the error* index in Elasticsearch?
>>>>>>>>>>>>>>> There are several catch blocks across the different topologies 
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> transform errors into json objects and forward them on to the 
>>>>>>>>>>>>>>> indexing
>>>>>>>>>>>>>>> topology.  If you're not seeing anything in the worker logs 
>>>>>>>>>>>>>>> it's likely the
>>>>>>>>>>>>>>> errors were captured there instead.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 9:19 AM, Ali Nazemian <
>>>>>>>>>>>>>>> alinazem...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> No everything is fine at the log level. Also, when I
>>>>>>>>>>>>>>>> checked resource consumption at the workers, there had been 
>>>>>>>>>>>>>>>> plenty
>>>>>>>>>>>>>>>> resources still available!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:04 PM, Casey Stella <
>>>>>>>>>>>>>>>> ceste...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Seeing anything in the storm logs for the workers?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 07:41 Ali Nazemian <
>>>>>>>>>>>>>>>>> alinazem...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> After I tried to tune the Metron performance I have
>>>>>>>>>>>>>>>>>> noticed the rate of failure for the indexing/enrichment 
>>>>>>>>>>>>>>>>>> topologies are very
>>>>>>>>>>>>>>>>>> high (about 95%). However, I can see the messages in 
>>>>>>>>>>>>>>>>>> Elasticsearch. I have
>>>>>>>>>>>>>>>>>> tried to increase the timeout value for the acknowledgement. 
>>>>>>>>>>>>>>>>>> It didn't fix
>>>>>>>>>>>>>>>>>> the problem. I can set the number of acker executors to 0 to 
>>>>>>>>>>>>>>>>>> temporarily
>>>>>>>>>>>>>>>>>> fix the problem which is not a good idea at all. Do you have 
>>>>>>>>>>>>>>>>>> any idea what
>>>>>>>>>>>>>>>>>> have caused such issue? The percentage of failure decreases 
>>>>>>>>>>>>>>>>>> by reducing the
>>>>>>>>>>>>>>>>>> number of parallelism, but even without any parallelism, it 
>>>>>>>>>>>>>>>>>> is still high!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>> Ali
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> A.Nazemian
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> A.Nazemian
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> A.Nazemian
>>>>
>>>
>>>
>>
>>
>> --
>> A.Nazemian
>>
>
>


-- 
A.Nazemian

Re: High percentage of failed/timed out tuples after performance tuning!

Reply via email to