Re: High (Max) CPU

Udo Kohlmeyer Fri, 02 Nov 2018 11:32:19 -0700

Hi there Rob,

Great to see that you found one of the problems. So essentially you toldthe async-queue to try and send whatever it had in it's queue every0ms... which means, it will just spin, wanting to send.

`batch-time-interval` and `batch-size` are there to determine WHENmessage batches get to sent. So, with a `batch-size=1` would also causethis queue to always fire, when there is 1 message in the queue. Iusually treat the settings like, "send a batch of xxx messages when thelimit is reached, otherwise wait yy-millis to send what is available"

Usually WAN Replication and Async-Queue (which share a common hierarchy)are async, fault-tolerant, batch-oriented mechanisms. To use them at thegranular level of every 0ms or 1 entry in the batch is a lot ofoverhead, in terms of effort to keep the primary and backup queues in sync.

Also, keep in mind, sending every message you receive might not bebeneficial either. (think financial rates). If one receives the data ata high frequency, it is always best to ask oneself, can theprocess/downstream system process and potentially respond to a changeBEFORE the next rate is delivered. So in some cases it makes sense toeven have batch-conflation turned on, to avoid sending messages thatcould essentially be ignored.

But if the requirement is to send (and react to) every message, thenthese two parameters are something I would test with to find optimalsend sizes and timeout. ALSO, take into account here, network sendbuffer sizes, can play a role on performance, so take that into accountsizing batches and configuring buffer sizes.


--Udo


On 11/2/18 09:56, Rob Shepherd wrote:

Thank you Nabarun,
Having started to pull out my config to send to you, I noticed thefollowing in my cache.xml:
<async-event-queue
id="expiry-event-queue"
parallel="false"
enable-batch-conflation="false"
disk-synchronous="true"
forward-expiration-destroy="true"
  —-> batch-time-interval=“0"
batch-size="1"
>...
…Which is a copy from the example here:https://github.com/apache/geode-examples/blob/master/async/scripts/start.gfsh
create async-event-queue --id=example-async-event-queue \
   --parallel=false \
   --enable-batch-conflation=false \
   --batch-size=1 \
   --batch-time-interval=0 \
   --listener=org.apache.geode_examples.async.ExampleAsyncEventListener
I pondered on “batch-time-interval” and set it to 1000 and it hasfixed the issue.
I think I understand what this parameter is for and so a delay herewould be tolerable.
I have another question, if I set parallel=“true” the gfsh startserver command hangs and I have to kill the new server process and thegfsh launcher.
it is not important to me now, but i would like to evaluate this atsome point and so i’ll happily try and debug the cause.
Thanks

Rob
On 2 Nov 2018, at 16:22, Nabarun Nag <[email protected]<mailto:[email protected]>> wrote:
Hi Rob,
We will look into this, meanwhile could you please elaborate on whatconfiguration is Apache Geode running, like how many servers, howmany AEQs regions etc, what workload is it running.
Thank you
Nabarun Nag
On Nov 2, 2018, at 8:37 AM, Rob Shepherd <[email protected]<mailto:[email protected]>> wrote:
Hi,

I’m using Geode (1.7.0)  locally (OSX) and on a server.  (Linux Arm64)

On both I’m seeing maxed out CPUs.
I’ve profiled it locally on a dormant server instance (noapplication activity) and the Async Queue routines are the highestcontributor to CPU activity by a long stretch.
<PastedGraphic-1.png>
Back Traces - Method Total Time [%] Total Time [µs] Total Time(CPU) Samples
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue.getHeadKey()
        100.00%         2829377292      2829377292      8639
.org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue.getCurrentKey()
        0.00%   0       0       8639
..org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue.peekAhead()
        0.00%   0       0       8263
...org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue.peek()
        1.21%   9906300         9906300         8180
....org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.processQueue()
        100.00%         1144915695      1144915695      5
.....org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.run()
        0.00%   0       0       5
How can I determine if this is a problem with my setup or if it is abug?
A supposition: I notice that there are multiple instances of athread named after my Async Event queue ID
Event Processor for GatewaySender_AsyncEventQueue_expiry-event-queue.1
Event Processor for GatewaySender_AsyncEventQueue_expiry-event-queue.2
Event Processor for GatewaySender_AsyncEventQueue_expiry-event-queue.3
Event Processor for GatewaySender_AsyncEventQueue_expiry-event-queue.4
Are there supposed to be 4? are they interfering with each other(wait/notify) on an empty queue?
Thanks for any insight

Rob

Re: High (Max) CPU

Reply via email to