Ashish,
This might not be a server problem... This could very much be a client
problem, that is not responding in a timely manner.. Can you also check
your client GC?
--Udo
On 10/19/18 11:46, aashish choudhary wrote:
Below is our configuration
4 Data Nodes,3 Locator Nodes
8vcpu per node
128 GB per data node. Allocated 64 GB of ram per data node. I don't
think it could be because of GC as our heap utilization is low.
Anyways will check the GC logs if it's related to that.
Thanks,
Ashish
On Fri, Oct 19, 2018, 11:32 PM Anthony Baker <[email protected]
<mailto:[email protected]>> wrote:
Look for GC pauses. You add flags to the startup options to
capture GC behavior and understand if you’re hitting a “stop the
world” pause. How much free heap space do you have?
Here’s a few links:
https://cwiki.apache.org/confluence/display/GEODE/Resource+Management+in+Geode
https://cwiki.apache.org/confluence/display/GEODE/Troubleshooting+Garbage+Collection+Pauses
Anthony
On Oct 19, 2018, at 10:17 AM, aashish choudhary
<[email protected]
<mailto:[email protected]>> wrote:
Thanks Charlie. I have watched the video and it was helpful.
Apart from overcommitted hardware could there be any other issue
from geode perspective i.e. slow server etc ?. Since we have
encountered this issue for the first time. For sure we will look
into steal/ready time.
Thanks,
Ashish
On Fri, Oct 19, 2018, 9:43 PM Charlie Black <[email protected]
<mailto:[email protected]>> wrote:
It's not normal for Geode to be not servicing requests. I
*do not* recommend changing the fault tolerances until you
find out why things aren't responding in 10 seconds to 1
minute. Imagine your users waiting for a minute or more
for an in-memory system to return a value.
Some things to look out for is overcommitted hardware. You
can review steal time on the guest OS. However, most
enterprises disable host reporting so you might have to
review Ready Time In MS on the Geode VM. This shows how
long a VM was waiting to run - if its anything larger then
zero - make sure this is what you want.
Here is a video where I talk about overcommitted hardware -
which is applicable to all things running on containers / vms.
https://www.youtube.com/watch?v=0I2oPBKctgU
Regards,
Charlie
On Fri, Oct 19, 2018 at 5:32 AM aashish choudhary
<[email protected]
<mailto:[email protected]>> wrote:
Hi,
Recently in one of our client application for geode we
are getting below exception.
Pool unexpected socket timed out on client
Server unreachable: could not
connect after 1 attempts
After looking at various threads came to know that we
need to set read-timeout in client configuration to a
higher value.Default is 10 seconds I believe. Just
curious to know why server would take more than 10
seconds to respond. As 10 seconds seems to be on a higher
side already.For now we will probably increase to 30
seconds atleast and observe it if makes any difference.
Also on the server side could see below warnings.
ClientHealthMonitor Unregistering client with member id
identity xxxx due to: Socket closed.
Monitoring client with member id identity xxxx It had
been 60534 ms since the latest heartbeat. Max interval is
60000. Terminated client.
Could this be because of high load on a particular
server? But we have seen these warnings on all of our
data nodes.
Any parameters we need to tune in server side for this?
Thanks,
Ashish
--
[email protected] <mailto:[email protected]> | +1.858.480.9722
Principal Realtime Data Engineer