RE: Monitor the neighbour JVM using neihbour's member-timeout

2018-02-20 Thread Aravind Musigumpula
Hi Community,

Any Comments on the below one.


Thanks,
Aravind Musigumpula 


-Original Message-
From: Bruce Schuchardt [mailto:bschucha...@pivotal.io] 
Sent: Thursday, January 18, 2018 11:58 PM
To: dev@geode.apache.org
Subject: Re: Monitor the neighbour JVM using neihbour's member-timeout

We don't use JGroups for membership anymore.  We rewrote all of it and now only 
use JGroups for UDP messaging.  We have complete control over the use of the 
member-timeout setting.

Aravind's idea is relevant to this group.

On 1/17/18 3:39 PM, Michael Stolz wrote:
> Pardon my ignorance, but is this something that should be brought up 
> on the JGroups community?
>
> --
> Mike Stolz
> Principal Engineer, GemFire Product Lead
> Mobile: +1-631-835-4771
> Download the new GemFire book here.
> <https://content.pivotal.io/ebooks/scaling-data-services-with-pivotal-
> gemfire>
>
> On Wed, Jan 17, 2018 at 2:37 AM, Aravind Musigumpula < 
> aravind.musigump...@amdocs.com> wrote:
>
>> Hi Everyone,
>>
>> Consider a Geode cluster in which some nodes contain a particular 
>> type of data which is critical to the business and hosts a large amount of 
>> data.
>> Some nodes may host data which is not critical to the business and 
>> hosts less amount of data compared to the previous type of nodes.
>>
>> If both the type of nodes are going through some operation which is 
>> making them unresponsive, the former type of node may take a couple 
>> of seconds extra than the later to respond.
>>
>> In this scenario is it fair to give the same member-timeout to all 
>> the members?
>> What if we want to wait for a little longer time for such nodes.
>>
>> In the present configuration in geode, we cannot wait a little longer 
>> for some nodes when compared to do this although we can configure 
>> different member-timeout for all the nodes. But i think no one will 
>> ever configure different timeouts for each node because those 
>> member-timeouts will be used to monitor their neighbors.
>>
>> In this solution, we all do is wait for the suspected member-timeout 
>> instead of its own timeout during final check.
>> It has no backward implications also, if somebody wants to use the 
>> existing behavior they will continue to use the same member-timeouts 
>> for all the nodes. So the behavior of the system is preserved.
>>
>> If you have any concerns in this solution, please let me know.
>>
>>
>> Thanks,
>> Aravind Musigumpula
>>
>>
>> -Original Message-
>> From: Aravind Musigumpula
>> Sent: Monday, December 18, 2017 6:55 PM
>> To: dev@geode.apache.org
>> Subject: RE: Monitor the neighbour JVM using neihbour's 
>> member-timeout
>>
>> Hi Community,
>>
>> Can you please give your suggestions on the below solution.
>>
>> I have raised a pull request for the same : 
>> https://github.com/apache/
>> geode/pull/1075 .
>>
>>
>> Thanks,
>> Aravind Musigumpula
>>
>> -Original Message-
>> From: Aravind Musigumpula
>> Sent: Friday, November 03, 2017 3:23 PM
>> To: dev@geode.apache.org
>> Subject: RE: Monitor the neighbour JVM using neihbour's 
>> member-timeout
>>
>> Thanks Bruce for suggestions, I will change the new variables from 
>> InternalDistributedMember to NetView and do changes related to 
>> backward compatibility.
>>
>> Now I know that there is another way that member can be removed from 
>> the view i.e if any member is sending a message and waits for 
>> ack-wait-threshold, if there is no response from the target the 
>> sender will do final check and remove it from the view if there is still no 
>> response.
>> But I don't understand how deprecating the settings member-timeout, 
>> ack-wait-threshold, ack-severe-alert-threshold into one will solve 
>> the problem. The main problem is that we want a member to survive in 
>> the view for longer time than others.
>>
>> If we deprecate the settings into one setting and pass the setting to 
>> monitoring member(say A), then it will use the target member(say B 
>> which we want to survive in view for longer time) timeout for health 
>> monitoring and ack-wait-threshold to wait for the response for any 
>> message before doing final check.
>> But what if some other member(say C) which is monitoring any other 
>> member(say D) have the member-timeout and ack-wait-threshold some 
>> smaller values. So if member C messages to B, C uses the smaller 
>> value of ack-wait-threshold(which is of member D) to get a response 
>&

RE: Monitor the neighbour JVM using neihbour's member-timeout

2018-01-17 Thread Aravind Musigumpula
Hi Everyone,

Consider a Geode cluster in which some nodes contain a particular type of data 
which is critical to the business and hosts a large amount of data. Some nodes 
may host data which is not critical to the business and hosts less amount of 
data compared to the previous type of nodes.

If both the type of nodes are going through some operation which is making them 
unresponsive, the former type of node may take a couple of seconds extra than 
the later to respond.

In this scenario is it fair to give the same member-timeout to all the members?
What if we want to wait for a little longer time for such nodes.

In the present configuration in geode, we cannot wait a little longer for some 
nodes when compared to do this although we can configure different 
member-timeout for all the nodes. But i think no one will ever configure 
different timeouts for each node because those member-timeouts will be used to 
monitor their neighbors.

In this solution, we all do is wait for the suspected member-timeout instead of 
its own timeout during final check.
It has no backward implications also, if somebody wants to use the existing 
behavior they will continue to use the same member-timeouts for all the nodes. 
So the behavior of the system is preserved.

If you have any concerns in this solution, please let me know.


Thanks,
Aravind Musigumpula 


-Original Message-
From: Aravind Musigumpula 
Sent: Monday, December 18, 2017 6:55 PM
To: dev@geode.apache.org
Subject: RE: Monitor the neighbour JVM using neihbour's member-timeout

Hi Community,

Can you please give your suggestions on the below solution.

I have raised a pull request for the same : 
https://github.com/apache/geode/pull/1075 .


Thanks,
Aravind Musigumpula 

-Original Message-
From: Aravind Musigumpula
Sent: Friday, November 03, 2017 3:23 PM
To: dev@geode.apache.org
Subject: RE: Monitor the neighbour JVM using neihbour's member-timeout

Thanks Bruce for suggestions, I will change the new variables from 
InternalDistributedMember to NetView and do changes related to backward 
compatibility.

Now I know that there is another way that member can be removed from the view 
i.e if any member is sending a message and waits for ack-wait-threshold, if 
there is no response from the target the sender will do final check and remove 
it from the view if there is still no response. 
But I don't understand how deprecating the settings member-timeout, 
ack-wait-threshold, ack-severe-alert-threshold into one will solve the problem. 
The main problem is that we want a member to survive in the view for longer 
time than others.

If we deprecate the settings into one setting and pass the setting to 
monitoring member(say A), then it will use the target member(say B which we 
want to survive in view for longer time) timeout for health monitoring and 
ack-wait-threshold to wait for the response for any message before doing final 
check.
But what if some other member(say C) which is monitoring any other member(say 
D) have the member-timeout and ack-wait-threshold some smaller values. So if 
member C messages to B, C uses the smaller value of ack-wait-threshold(which is 
of member D) to get a response and does the final check again on basis of 
smaller member-timeout. So still member B can be kicked out of the view in 
small amount of time.

I think this can be solved simply if we use the member-timeout of suspected 
member in the final check where we establish TCP connection. We don't need to 
club those three settings as well. We can set the member-timeout of a 
particular member to a higher value and the member which monitors it uses its 
own member-timeout as it is now, but during the final check it uses the 
suspected member-timeout(which is a greater value). The final check is common 
place in both the no heartbeat scenario and no response for a message scenario.

Are there any concerns around this new proposal ?


Thanks,
Aravind Musigumpula 

-Original Message-
From: Bruce Schuchardt [mailto:bschucha...@pivotal.io]
Sent: Thursday, September 07, 2017 10:42 PM
To: dev@geode.apache.org
Subject: Re: Monitor the neighbour JVM using neihbour's member-timeout

I think this might be an acceptable change though I doubt many people would 
find it useful.

It's already possible to set different member-timeouts on each node of the 
distributed system but the meaning of the setting is the inverse of what's 
proposed here, so having the current setting be different in each node is 
pretty useless.

I think the initiation of suspect processing ought to be addressed if we make 
this change.  The ack-wait-threshold and ack-severe-alert-threshold aren't 
based on the member-timeout but ought to be.  This would make it possible to 
initiate suspect processing with different timing for different nodes.  It 
would still leave the question of slow backup operations hanging:  If you're 
waiting for one node that's blocked waiting for a response from

RE: Monitor the neighbour JVM using neihbour's member-timeout

2017-12-18 Thread Aravind Musigumpula
Hi Community,

Can you please give your suggestions on the below solution.

I have raised a pull request for the same : 
https://github.com/apache/geode/pull/1075 .


Thanks,
Aravind Musigumpula 

-Original Message-
From: Aravind Musigumpula 
Sent: Friday, November 03, 2017 3:23 PM
To: dev@geode.apache.org
Subject: RE: Monitor the neighbour JVM using neihbour's member-timeout

Thanks Bruce for suggestions, I will change the new variables from 
InternalDistributedMember to NetView and do changes related to backward 
compatibility.

Now I know that there is another way that member can be removed from the view 
i.e if any member is sending a message and waits for ack-wait-threshold, if 
there is no response from the target the sender will do final check and remove 
it from the view if there is still no response. 
But I don't understand how deprecating the settings member-timeout, 
ack-wait-threshold, ack-severe-alert-threshold into one will solve the problem. 
The main problem is that we want a member to survive in the view for longer 
time than others.

If we deprecate the settings into one setting and pass the setting to 
monitoring member(say A), then it will use the target member(say B which we 
want to survive in view for longer time) timeout for health monitoring and 
ack-wait-threshold to wait for the response for any message before doing final 
check.
But what if some other member(say C) which is monitoring any other member(say 
D) have the member-timeout and ack-wait-threshold some smaller values. So if 
member C messages to B, C uses the smaller value of ack-wait-threshold(which is 
of member D) to get a response and does the final check again on basis of 
smaller member-timeout. So still member B can be kicked out of the view in 
small amount of time.

I think this can be solved simply if we use the member-timeout of suspected 
member in the final check where we establish TCP connection. We don't need to 
club those three settings as well. We can set the member-timeout of a 
particular member to a higher value and the member which monitors it uses its 
own member-timeout as it is now, but during the final check it uses the 
suspected member-timeout(which is a greater value). The final check is common 
place in both the no heartbeat scenario and no response for a message scenario.

Are there any concerns around this new proposal ?


Thanks,
Aravind Musigumpula 

-Original Message-
From: Bruce Schuchardt [mailto:bschucha...@pivotal.io]
Sent: Thursday, September 07, 2017 10:42 PM
To: dev@geode.apache.org
Subject: Re: Monitor the neighbour JVM using neihbour's member-timeout

I think this might be an acceptable change though I doubt many people would 
find it useful.

It's already possible to set different member-timeouts on each node of the 
distributed system but the meaning of the setting is the inverse of what's 
proposed here, so having the current setting be different in each node is 
pretty useless.

I think the initiation of suspect processing ought to be addressed if we make 
this change.  The ack-wait-threshold and ack-severe-alert-threshold aren't 
based on the member-timeout but ought to be.  This would make it possible to 
initiate suspect processing with different timing for different nodes.  It 
would still leave the question of slow backup operations hanging:  If you're 
waiting for one node that's blocked waiting for a response from another node 
(say a node holding a backup
bucket) you are going to initiate suspect processing on the node you're waiting 
on & not those other (backup) nodes.

Rolling upgrade will also be a problem since old members aren't going to cough 
up their member-timeout settings.  What should be used as a membership timeout 
for the old members during an upgrade?

If we proceed with this idea I'd prefer that we deprecate member-timeout, 
ack-wait-threshold and ack-severe-alert-threshold and have new settings with 
the "ack" settings being multiples of the new membership timeout setting.

Concerning the PR, it isn't acceptable in its current form. 
InternalDistributedMember identifiers are often transmitted in messages and 
increasing their size affects performance.  Any new member attributes need to 
be added to NetView instead of InternalDistributedMember.


On 8/22/17 12:35 AM, Aravind Musigumpula wrote:
> Hi Team,
>
> We have a requirement to configure  different member timeout for different 
> members as we need some members to survive in the view for longer time than 
> the other the members before being kicked out of the view in case they aren't 
> responding.
>
>
> 1.   Now with the current monitoring system it is not possible to 
> determine when the member will be kicked out of the view if we configure 
> different member-timeout's for some required members.
>
> 2.   Because if a member is not responding to any heartbeat requests, the 
> member who is monitoring the non

RE: Monitor the neighbour JVM using neihbour's member-timeout

2017-11-03 Thread Aravind Musigumpula
Thanks Bruce for suggestions, I will change the new variables from 
InternalDistributedMember to NetView and do changes related to backward 
compatibility.

Now I know that there is another way that member can be removed from the view 
i.e if any member is sending a message and waits for ack-wait-threshold, if 
there is no response from the target the sender will do final check and remove 
it from the view if there is still no response. 
But I don't understand how deprecating the settings member-timeout, 
ack-wait-threshold, ack-severe-alert-threshold into one will solve the problem. 
The main problem is that we want a member to survive in the view for longer 
time than others.

If we deprecate the settings into one setting and pass the setting to 
monitoring member(say A), then it will use the target member(say B which we 
want to survive in view for longer time) timeout for health monitoring and 
ack-wait-threshold to wait for the response for any message before doing final 
check.
But what if some other member(say C) which is monitoring any other member(say 
D) have the member-timeout and ack-wait-threshold some smaller values. So if 
member C messages to B, C uses the smaller value of ack-wait-threshold(which is 
of member D) to get a response and does the final check again on basis of 
smaller member-timeout. So still member B can be kicked out of the view in 
small amount of time.

I think this can be solved simply if we use the member-timeout of suspected 
member in the final check where we establish TCP connection. We don't need to 
club those three settings as well. We can set the member-timeout of a 
particular member to a higher value and the member which monitors it uses its 
own member-timeout as it is now, but during the final check it uses the 
suspected member-timeout(which is a greater value). The final check is common 
place in both the no heartbeat scenario and no response for a message scenario.

Are there any concerns around this new proposal ?


Thanks,
Aravind Musigumpula 

-Original Message-
From: Bruce Schuchardt [mailto:bschucha...@pivotal.io]
Sent: Thursday, September 07, 2017 10:42 PM
To: dev@geode.apache.org
Subject: Re: Monitor the neighbour JVM using neihbour's member-timeout

I think this might be an acceptable change though I doubt many people would 
find it useful.

It's already possible to set different member-timeouts on each node of the 
distributed system but the meaning of the setting is the inverse of what's 
proposed here, so having the current setting be different in each node is 
pretty useless.

I think the initiation of suspect processing ought to be addressed if we make 
this change.  The ack-wait-threshold and ack-severe-alert-threshold aren't 
based on the member-timeout but ought to be.  This would make it possible to 
initiate suspect processing with different timing for different nodes.  It 
would still leave the question of slow backup operations hanging:  If you're 
waiting for one node that's blocked waiting for a response from another node 
(say a node holding a backup
bucket) you are going to initiate suspect processing on the node you're waiting 
on & not those other (backup) nodes.

Rolling upgrade will also be a problem since old members aren't going to cough 
up their member-timeout settings.  What should be used as a membership timeout 
for the old members during an upgrade?

If we proceed with this idea I'd prefer that we deprecate member-timeout, 
ack-wait-threshold and ack-severe-alert-threshold and have new settings with 
the "ack" settings being multiples of the new membership timeout setting.

Concerning the PR, it isn't acceptable in its current form. 
InternalDistributedMember identifiers are often transmitted in messages and 
increasing their size affects performance.  Any new member attributes need to 
be added to NetView instead of InternalDistributedMember.


On 8/22/17 12:35 AM, Aravind Musigumpula wrote:
> Hi Team,
>
> We have a requirement to configure  different member timeout for different 
> members as we need some members to survive in the view for longer time than 
> the other the members before being kicked out of the view in case they aren't 
> responding.
>
>
> 1.   Now with the current monitoring system it is not possible to 
> determine when the member will be kicked out of the view if we configure 
> different member-timeout's for some required members.
>
> 2.   Because if a member is not responding to any heartbeat requests, the 
> member who is monitoring the non-responding member will initiate check member 
> request.
>
> 3.   In this check member request monitoring member pings the 
> non-responding member and waits for member-timeout of monitoring member for a 
> response.
>
> 4.   If still there is no response, it will initiate a final suspect 
> request to coordinator where the coordinator does the final c

RE: DISCUSS : Monitor the neighbour JVM using neihbour's member-timeout (GEODE-3411)

2017-09-06 Thread Aravind Musigumpula
Hi Udo,

For your question: "If you feel that the member timeout is too short for some 
members, why don't you increase the current member timeout?"

Yes, for some members I feel that member-timeout is short. I want to increase 
the timeout for some members. But that timeout is not being used to monitor 
themselves but instead the increased member timeout may be used to monitor some 
other member.

If I want some member to be alive for a little more time even if it is not 
responding, Now I need to increase the timeout of all the members.

Do you mean to increase the current member timeout for all the members.


Thanks,
Aravind Musigumpula 


-Original Message-
From: Udo Kohlmeyer [mailto:ukohlme...@pivotal.io] 
Sent: Friday, September 01, 2017 10:05 PM
To: dev@geode.apache.org
Subject: Re: DISCUSS : Monitor the neighbour JVM using neihbour's 
member-timeout (GEODE-3411)

Hi there Aravind,

I have a singular problem with this approach.

If a some members are designated to do more work, and don't have time to 
respond to the cluster that they are alive using the current member timeout, 
then they are not available to accept data. Which means they are not effective 
members of the cluster and we cannot count on them to host data or replicates.

This setting is there to safe guard the cluster against non-responsive members 
that cause the whole cluster to be unhealthy if left unchecked for too long. 
This can lead to potential data loss

If you feel that the member timeout is too short for some members, why don't 
you increase the current member timeout?

My opinion is a -1 for changing the current behavior.

--Udo

On 9/1/17 03:46, Aravind Musigumpula wrote:
> Hi Brian,
>
> This will help if the user has some member doing a heavy duty when compared 
> to others, in this case we need to give such member some extra time to that 
> member.
>
> Thanks,
> Aravind Musigumpula
>
>
> -Original Message-
> From: Brian Baynes [mailto:bbay...@pivotal.io]
> Sent: Friday, September 01, 2017 4:39 AM
> To: dev@geode.apache.org
> Subject: Re: DISCUSS : Monitor the neighbour JVM using neihbour's 
> member-timeout (GEODE-3411)
>
> Hi, Aravind.
>
> Can you help me understand why this might be a useful feature for Geode?  I 
> see that your needs require it, but why would users in general want to allow 
> longer timeouts for some members?  This is a significant change with 
> backward-compatibility implications, so would be good for the community to 
> understand the potential benefit.
>
> Thanks!
> Brian
>
> On Mon, Aug 28, 2017 at 12:20 AM, Aravind Musigumpula < 
> aravind.musigump...@amdocs.com> wrote:
>
>> Hi Team,
>>
>> We have a requirement to configure  different member timeout for 
>> different members as we need some members to survive in the view for 
>> longer time than the other the members before being kicked out of the 
>> view in case they aren't responding.
>>
>>
>> 1.   Now with the current monitoring system it is not possible to
>> determine when the member will be kicked out of the view if we 
>> configure different member-timeout's for some required members.
>>
>> 2.   Because if a member is not responding to any heartbeat requests,
>> the member who is monitoring the non-responding member will initiate 
>> check member request.
>>
>> 3.   In this check member request monitoring member pings the
>> non-responding member and waits for member-timeout of monitoring 
>> member for a response.
>>
>> 4.   If still there is no response, it will initiate a final suspect
>> request to coordinator where the coordinator does the final check 
>> waiting for coordinators member-timeout.
>>
>> 5.   If coordinator did not get any response, it will remove the
>> non-responding member from the view and publishes it.
>>
>> 6.   So, Here the time period for removing a member depends on its
>> monitoring member's and coordinator's timeout. But the monitoring 
>> member depends on the view but it may change from time to time.
>>
>> So, now when a monitoring-member doing the check on a member, if we 
>> wait for the non-responding member's timeout instead of the 
>> monitoring member-timeout, then the time when the non-responding 
>> member will be removed from the view depends on its own 
>> member-timeout and the coordinators member-timeout.
>> Hence we can configure different member-timeout for the required members.
>>
>> I created a pull request based on the above scenario:
>> https://github.com/apache/geode/pull/717
>>
>> Is the above approach correct? Do we have any concerns around this area?
>> Please give your insight

RE: DISCUSS : Monitor the neighbour JVM using neihbour's member-timeout (GEODE-3411)

2017-09-01 Thread Aravind Musigumpula
Hi Brian,

This will help if the user has some member doing a heavy duty when compared to 
others, in this case we need to give such member some extra time to that member.

Thanks,
Aravind Musigumpula 


-Original Message-
From: Brian Baynes [mailto:bbay...@pivotal.io] 
Sent: Friday, September 01, 2017 4:39 AM
To: dev@geode.apache.org
Subject: Re: DISCUSS : Monitor the neighbour JVM using neihbour's 
member-timeout (GEODE-3411)

Hi, Aravind.

Can you help me understand why this might be a useful feature for Geode?  I see 
that your needs require it, but why would users in general want to allow longer 
timeouts for some members?  This is a significant change with 
backward-compatibility implications, so would be good for the community to 
understand the potential benefit.

Thanks!
Brian

On Mon, Aug 28, 2017 at 12:20 AM, Aravind Musigumpula < 
aravind.musigump...@amdocs.com> wrote:

> Hi Team,
>
> We have a requirement to configure  different member timeout for 
> different members as we need some members to survive in the view for 
> longer time than the other the members before being kicked out of the 
> view in case they aren't responding.
>
>
> 1.   Now with the current monitoring system it is not possible to
> determine when the member will be kicked out of the view if we 
> configure different member-timeout's for some required members.
>
> 2.   Because if a member is not responding to any heartbeat requests,
> the member who is monitoring the non-responding member will initiate 
> check member request.
>
> 3.   In this check member request monitoring member pings the
> non-responding member and waits for member-timeout of monitoring 
> member for a response.
>
> 4.   If still there is no response, it will initiate a final suspect
> request to coordinator where the coordinator does the final check 
> waiting for coordinators member-timeout.
>
> 5.   If coordinator did not get any response, it will remove the
> non-responding member from the view and publishes it.
>
> 6.   So, Here the time period for removing a member depends on its
> monitoring member's and coordinator's timeout. But the monitoring 
> member depends on the view but it may change from time to time.
>
> So, now when a monitoring-member doing the check on a member, if we 
> wait for the non-responding member's timeout instead of the monitoring 
> member-timeout, then the time when the non-responding member will be 
> removed from the view depends on its own member-timeout and the 
> coordinators member-timeout.
> Hence we can configure different member-timeout for the required members.
>
> I created a pull request based on the above scenario:
> https://github.com/apache/geode/pull/717
>
> Is the above approach correct? Do we have any concerns around this area?
> Please give your insights on this issue.
>
> Thanks,
> Aravind Musigumpula
>
> This message and the information contained herein is proprietary and 
> confidential and subject to the Amdocs policy statement,
>
> you may review at https://www.amdocs.com/about/email-disclaimer < 
> https://www.amdocs.com/about/email-disclaimer>
>
This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer 
<https://www.amdocs.com/about/email-disclaimer>


DISCUSS : Monitor the neighbour JVM using neihbour's member-timeout (GEODE-3411)

2017-08-28 Thread Aravind Musigumpula
Hi Team,

We have a requirement to configure  different member timeout for different 
members as we need some members to survive in the view for longer time than the 
other the members before being kicked out of the view in case they aren't 
responding.


1.   Now with the current monitoring system it is not possible to determine 
when the member will be kicked out of the view if we configure different 
member-timeout's for some required members.

2.   Because if a member is not responding to any heartbeat requests, the 
member who is monitoring the non-responding member will initiate check member 
request.

3.   In this check member request monitoring member pings the 
non-responding member and waits for member-timeout of monitoring member for a 
response.

4.   If still there is no response, it will initiate a final suspect 
request to coordinator where the coordinator does the final check waiting for 
coordinators member-timeout.

5.   If coordinator did not get any response, it will remove the 
non-responding member from the view and publishes it.

6.   So, Here the time period for removing a member depends on its 
monitoring member's and coordinator's timeout. But the monitoring member 
depends on the view but it may change from time to time.

So, now when a monitoring-member doing the check on a member, if we wait for 
the non-responding member's timeout instead of the monitoring member-timeout, 
then the time when the non-responding member will be removed from the view 
depends on its own member-timeout and the coordinators member-timeout.
Hence we can configure different member-timeout for the required members.

I created a pull request based on the above scenario: 
https://github.com/apache/geode/pull/717

Is the above approach correct? Do we have any concerns around this area?
Please give your insights on this issue.

Thanks,
Aravind Musigumpula

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer 
<https://www.amdocs.com/about/email-disclaimer>


Monitor the neighbour JVM using neihbour's member-timeout

2017-08-22 Thread Aravind Musigumpula
Hi Team,

We have a requirement to configure  different member timeout for different 
members as we need some members to survive in the view for longer time than the 
other the members before being kicked out of the view in case they aren't 
responding.


1.   Now with the current monitoring system it is not possible to determine 
when the member will be kicked out of the view if we configure different 
member-timeout's for some required members.

2.   Because if a member is not responding to any heartbeat requests, the 
member who is monitoring the non-responding member will initiate check member 
request.

3.   In this check member request monitoring member pings the 
non-responding member and waits for member-timeout of monitoring member for a 
response.

4.   If still there is no response, it will initiate a final suspect 
request to coordinator where the coordinator does the final check waiting for 
coordinators member-timeout.

5.   If coordinator did not get any response, it will remove the 
non-responding member from the view and publishes it.

6.   So, Here the time period for removing a member depends on its 
monitoring member's and coordinator's timeout. But the monitoring member 
depends on the view but it may change from time to time.

So, now when a monitoring-member doing the check on a member, if we wait for 
the non-responding member's timeout instead of the monitoring member-timeout, 
then the time when the non-responding member will be removed from the view 
depends on its own member-timeout and the coordinators member-timeout.
Hence we can configure different member-timeout for the required members.

I created a pull request based on the above scenario: 
https://github.com/apache/geode/pull/717

Is the above approach correct? Do we have any concerns around this area?
Please give your insights on this issue.

Thanks,
Aravind Musigumpula

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer 
<https://www.amdocs.com/about/email-disclaimer>


RE: Different member-timeout for particular jvm’s

2017-07-05 Thread Aravind Musigumpula
Hi,
May be I was not clear in my last mail. My question is that can we monitor a 
jvm based on it’s own member timeout instead of the member timeout of some 
other jvm(which is monitoring the this jvm).

Now if my view is [s1(coordinator), s2, s3, s4, s5] and member timeout for each 
member is different, the member s3 will suspect s4 on the basis of s3 member 
timeout and then final check will be done by coordinator member timeout. Every 
time the view changes, the order of monitoring also changes. So we cannot 
determine for how much time will a particular jvm will be removed from the view.

This can be solved if we use the member timeout of the jvm which is being 
monitored by the current member.

In the above view suppose s3 is monitoring s4. Now s3 marks s4 as suspect 
member on the basis of  s3 member timeout. Instead of this if s3 gets the 
member timeout of s4 and uses this new timeout to monitor s4, then we can 
determine for how much time a member will be removed from the view.

Is there any way to get the member timeout of one member from an another ?


Thanks,
Aravind Musigumpula

From: Aravind Musigumpula
Sent: Monday, July 03, 2017 9:35 PM
To: u...@geode.apache.org
Cc: bschucha...@pivotal.io
Subject: RE: Different member-timeout for particular jvm’s

Hi,

Can the member-timeout of a particular jvm can be used by the monitoring jvm.
Example: jvm1 monitors jvm2, jvm2 monitors jvm 3. Member timeout for jvm1 is 
10, jvm2 is 20 and jvm3 is 30. Suppose the jvm4 is coordinator and its member 
timeout is 30. So what if we want jvm1 should be monitored by other jvm’s for a 
deterministic time like 10 and jvm2 should be monitored for 20.

Right now, I understand that a jvm will be monitored by member timeout of the 
monitoring jvm and coordinator. My requirement is the each jvm should be 
monitored by its own member timeout followed by coordinator’s member-timeout.

In code there is a wait for a member timeout variable in GMSHealthMonitor.java.
if (pingResp.getResponseMsg() == null) {
pingResp.wait(memberTimeout);
  }

What if we get the member timeout of the jvm which is monitored by this one. We 
can do this in setNextNeighbor function in GMSHealthMonitor.java
But how to get the member-timeout of other jvm. Is it possible?

Thanks,
Aravind Musigumpula

From: Bruce Schuchardt [mailto:bschucha...@pivotal.io]
Sent: Tuesday, June 20, 2017 9:16 PM
To: u...@geode.apache.org<mailto:u...@geode.apache.org>
Subject: Re: Different member-timeout for particular jvm’s

It is the membership coordinator that performs the final check on a suspect 
member.  If you have network partition detection enabled or are using 
authentication of peers the role of membership coordinator will be a locator 
(if one is in the system) so in your scenario it will be the Locator that 
performs this check.  It will use its own member-timeout to determine how long 
to wait for a response to a "final check" message to the suspected member.

If the Locator is down then the oldest member in the system will take over the 
role.  This might be server1 if the membership view is [ s1, s2, s4, s3 ].  If 
there is a problem with s2 then s1 will use its own member-timeout setting to 
determine how long to wait for a final-check response from s1.

On 6/20/17 8:18 AM, Aravind Musigumpula wrote:
Hi,

Is there any way to configure different member-timeout for particular jvm’s.

According to my understanding each jvm monitors its neighbor. If any jvm is 
missing heart beat from its neighbor, it waits for member-timeout interval and 
sends a suspect message. Then Coordinator tries to contact that particular jvm, 
if it is unable to connect to that jvm. Coordinator waits for its configured 
member-timeout interval and removes that member if it is unable to connect to 
that jvm.

Scenario:
Locator: member-timeout=1
Server1: member-timeout=2
Server2: member-timeout=3
Server3: member-timeout=2
Server4: member-timeout=2

Suppose server1 is monitoring server2. I made server2 stuck. So server1 tries 
to contact the server2 , waits for 2ms and sends suspect message. Then 
locator tries to connect with server2 , if unable to connect waits for 1 
and removes server3 from view.

My requirement is I don’t want to kick server2 until 4ms. This can be done 
by setting 3 for the jvm which monitors server2. But how can we see that 
this particular jvm monitors server2. In my case every time different jvm is 
monitoring server2.
Please correct me if I am wrong.


Thanks,
Aravind Musigumpula

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,
you may review at https://www.amdocs.com/about/email-disclaimer

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,
you may review at https://www.amdocs.com/about/email-disclaimer
This message and the infor

Geode Exception: cluster configuration service not available

2017-05-24 Thread Aravind Musigumpula

Hi,

I am using a cluster configuration in geode 1.1.1 . I am starting two locators 
on different hosts and one server for each locator. When I stop them and 
restart the cluster, I can see that in one of the locator view , it is 
receiving only one locator. In gfsh list members, I can see only one locator of 
that host but no server and no other locator and its server.

I tried enabling the following parameters:
In locator-specific-props: I have set "enable-cluster-configuration=true"
In sever-common-props: I have set "disable-auto-reconnect=false" , 
"use-cluster-configuration=true"

In Server cache log, I am getting an exception :
Cache server error
org.apache.geode.GemFireConfigException: cluster configuration service not 
available
at 
org.apache.geode.internal.cache.GemFireCacheImpl.requestSharedConfiguration(GemFireCacheImpl.java:1067)
at 
org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1200)
at 
org.apache.geode.internal.cache.GemFireCacheImpl.basicCreate(GemFireCacheImpl.java:798)
at 
org.apache.geode.internal.cache.GemFireCacheImpl.create(GemFireCacheImpl.java:783)
at org.apache.geode.cache.CacheFactory.create(CacheFactory.java:178)
at org.apache.geode.cache.CacheFactory.create(CacheFactory.java:171)
at 
org.apache.geode.internal.cache.CacheServerLauncher.createCache(CacheServerLauncher.java:813)
at 
org.apache.geode.internal.cache.CacheServerLauncher.server(CacheServerLauncher.java:657)
at 
org.apache.geode.internal.cache.CacheServerLauncher.main(CacheServerLauncher.java:201)
Caused by: 
org.apache.geode.internal.process.ClusterConfigurationNotAvailableException: 
Unable to retrieve cluster configuration from the locator.
at 
org.apache.geode.internal.cache.ClusterConfigurationLoader.requestConfigurationFromLocators(ClusterConfigurationLoader.java:245)
at 
org.apache.geode.internal.cache.GemFireCacheImpl.requestSharedConfiguration(GemFireCacheImpl.java:1029)
... 8 more
VM is exiting - shutting down distributed system
In one of the locator log, I can see this:
Region /_ConfigurationRegion has potentially stale data. It is waiting for 
another member to recover the latest data.
.
.
.

 tid=0x39] View Creator is processing 2 requests 
for the next membership view

received new view: View ...  members: ..  shutdown: ...
old view is: View... members...
Peer locator received new membership view: View members: ...   shutdown: ...
 tid=0x39] no recipients for new view aside from 
myself


This seems to be solved in geode 1.1.1 . 
https://issues.apache.org/jira/browse/GEODE-1986
Can anybody help me with this issue.


Thanks,
Aravind

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer