Re: nodetool status shows large numbers of up nodes are down

2015-02-10 Thread Cheng Ren
Hi Carlos,
Thanks for your suggestion. We did check the NTP setting and clock, and
they are all working normally. Schema versions are also consistent with
peers'.
BTW, the only change we made was to set some of nodes' request
timeout(read_request_timeout, write_request_timeout, range_request_timeout
and request_timeout) from 3 to 1 for 6 nodes yesterday. Will this
affect internode gossip?

Thanks,
Cheng

On Mon, Feb 9, 2015 at 11:07 PM, Carlos Rolo r...@pythian.com wrote:

 Hi Cheng,

 Are all machines configured with NTP and all clocks in sync? If that is
 not the case do it.

 If your clocks are not in sync it causes some weird issues like the ones
 you see, but also schema disagreements and in some cases corrupted data.

 Regards,

 Regards,

 Carlos Juzarte Rolo
 Cassandra Consultant

 Pythian - Love your data

 rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
 http://linkedin.com/in/carlosjuzarterolo*
 Tel: 1649
 www.pythian.com

 On Tue, Feb 10, 2015 at 3:40 AM, Cheng Ren cheng@bloomreach.com
 wrote:

 Hi,
 We have a two-dc cluster with 21 nodes and 27 nodes in each DC. Over the
 past few months, we have seen nodetool status marks 4-8 nodes down while
 they are actually functioning. Particularly today we noticed that running
 nodetool status on some nodes shows higher number of nodes are down than
 before while they are actually up and serving requests.
 For example, on one node it shows 42 nodes are down.

 phi_convict_threshold of all nodes are set as 12, and we are running
 cassandra 2.0.4 on AWS EC2 machines.

 Does anyone have recommendation on identifying the root cause of this?
 Will this cause any consequences?

 Thanks,
 Cheng



 --






Re: nodetool status shows large numbers of up nodes are down

2015-02-10 Thread Chris Lohfink
Are you hitting long GCs on your nodes? Can check gc log or look at
cassandra log for GCInspector.

Chris

On Tue, Feb 10, 2015 at 1:28 PM, Cheng Ren cheng@bloomreach.com wrote:

 Hi Carlos,
 Thanks for your suggestion. We did check the NTP setting and clock, and
 they are all working normally. Schema versions are also consistent with
 peers'.
 BTW, the only change we made was to set some of nodes' request
 timeout(read_request_timeout, write_request_timeout, range_request_timeout
 and request_timeout) from 3 to 1 for 6 nodes yesterday. Will this
 affect internode gossip?

 Thanks,
 Cheng

 On Mon, Feb 9, 2015 at 11:07 PM, Carlos Rolo r...@pythian.com wrote:

 Hi Cheng,

 Are all machines configured with NTP and all clocks in sync? If that is
 not the case do it.

 If your clocks are not in sync it causes some weird issues like the ones
 you see, but also schema disagreements and in some cases corrupted data.

 Regards,

 Regards,

 Carlos Juzarte Rolo
 Cassandra Consultant

 Pythian - Love your data

 rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
 http://linkedin.com/in/carlosjuzarterolo*
 Tel: 1649
 www.pythian.com

 On Tue, Feb 10, 2015 at 3:40 AM, Cheng Ren cheng@bloomreach.com
 wrote:

 Hi,
 We have a two-dc cluster with 21 nodes and 27 nodes in each DC. Over the
 past few months, we have seen nodetool status marks 4-8 nodes down while
 they are actually functioning. Particularly today we noticed that running
 nodetool status on some nodes shows higher number of nodes are down than
 before while they are actually up and serving requests.
 For example, on one node it shows 42 nodes are down.

 phi_convict_threshold of all nodes are set as 12, and we are running
 cassandra 2.0.4 on AWS EC2 machines.

 Does anyone have recommendation on identifying the root cause of this?
 Will this cause any consequences?

 Thanks,
 Cheng



 --







Re: nodetool status shows large numbers of up nodes are down

2015-02-10 Thread Carlos Rolo
Can you run nodetool tpstats and check if there is pending requests on
GossipStage.
The timeout should not affect gossip (AFAIK).
As for problems you can have with this state is, if your nodes are marked
down for long and if you are using hinted handoff, your hints may not be
delivered and your data can be out of sync (can be fixed by increasing the
timeout limit or during repairs).

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
http://linkedin.com/in/carlosjuzarterolo*
Tel: 1649
www.pythian.com

On Tue, Feb 10, 2015 at 8:51 PM, Chris Lohfink clohfin...@gmail.com wrote:

 Are you hitting long GCs on your nodes? Can check gc log or look at
 cassandra log for GCInspector.

 Chris

 On Tue, Feb 10, 2015 at 1:28 PM, Cheng Ren cheng@bloomreach.com
 wrote:

 Hi Carlos,
 Thanks for your suggestion. We did check the NTP setting and clock, and
 they are all working normally. Schema versions are also consistent with
 peers'.
 BTW, the only change we made was to set some of nodes' request
 timeout(read_request_timeout, write_request_timeout, range_request_timeout
 and request_timeout) from 3 to 1 for 6 nodes yesterday. Will this
 affect internode gossip?

 Thanks,
 Cheng

 On Mon, Feb 9, 2015 at 11:07 PM, Carlos Rolo r...@pythian.com wrote:

 Hi Cheng,

 Are all machines configured with NTP and all clocks in sync? If that is
 not the case do it.

 If your clocks are not in sync it causes some weird issues like the ones
 you see, but also schema disagreements and in some cases corrupted data.

 Regards,

 Regards,

 Carlos Juzarte Rolo
 Cassandra Consultant

 Pythian - Love your data

 rolo@pythian | Twitter: cjrolo | Linkedin: 
 *linkedin.com/in/carlosjuzarterolo
 http://linkedin.com/in/carlosjuzarterolo*
 Tel: 1649
 www.pythian.com

 On Tue, Feb 10, 2015 at 3:40 AM, Cheng Ren cheng@bloomreach.com
 wrote:

 Hi,
 We have a two-dc cluster with 21 nodes and 27 nodes in each DC. Over
 the past few months, we have seen nodetool status marks 4-8 nodes down
 while they are actually functioning. Particularly today we noticed that
 running nodetool status on some nodes shows higher number of nodes are down
 than before while they are actually up and serving requests.
 For example, on one node it shows 42 nodes are down.

 phi_convict_threshold of all nodes are set as 12, and we are running
 cassandra 2.0.4 on AWS EC2 machines.

 Does anyone have recommendation on identifying the root cause of this?
 Will this cause any consequences?

 Thanks,
 Cheng



 --







-- 


--





nodetool status shows large numbers of up nodes are down

2015-02-09 Thread Cheng Ren
Hi,
We have a two-dc cluster with 21 nodes and 27 nodes in each DC. Over the
past few months, we have seen nodetool status marks 4-8 nodes down while
they are actually functioning. Particularly today we noticed that running
nodetool status on some nodes shows higher number of nodes are down than
before while they are actually up and serving requests.
For example, on one node it shows 42 nodes are down.

phi_convict_threshold of all nodes are set as 12, and we are running
cassandra 2.0.4 on AWS EC2 machines.

Does anyone have recommendation on identifying the root cause of this? Will
this cause any consequences?

Thanks,
Cheng


Re: nodetool status shows large numbers of up nodes are down

2015-02-09 Thread Carlos Rolo
Hi Cheng,

Are all machines configured with NTP and all clocks in sync? If that is not
the case do it.

If your clocks are not in sync it causes some weird issues like the ones
you see, but also schema disagreements and in some cases corrupted data.

Regards,

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
http://linkedin.com/in/carlosjuzarterolo*
Tel: 1649
www.pythian.com

On Tue, Feb 10, 2015 at 3:40 AM, Cheng Ren cheng@bloomreach.com wrote:

 Hi,
 We have a two-dc cluster with 21 nodes and 27 nodes in each DC. Over the
 past few months, we have seen nodetool status marks 4-8 nodes down while
 they are actually functioning. Particularly today we noticed that running
 nodetool status on some nodes shows higher number of nodes are down than
 before while they are actually up and serving requests.
 For example, on one node it shows 42 nodes are down.

 phi_convict_threshold of all nodes are set as 12, and we are running
 cassandra 2.0.4 on AWS EC2 machines.

 Does anyone have recommendation on identifying the root cause of this?
 Will this cause any consequences?

 Thanks,
 Cheng


-- 


--