Re: [EXTERNAL] Re: Bootstrap keeps failing

2019-02-07 Thread Léo FERLIN SUTTON
Thank you for the recommendation.

We are already using datastax's recommended settings for tcp_keepalive.

Regards,

Leo

On Thu, Feb 7, 2019 at 5:49 PM Durity, Sean R 
wrote:

> I have seen unreliable streaming (streaming that doesn’t finish) because
> of TCP timeouts from firewalls or switches. The default tcp_keepalive
> kernel parameters are usually not tuned for that. See
> https://docs.datastax.com/en/dse-trblshoot/doc/troubleshooting/idleFirewallLinux.html
> for more details. These “remote” timeouts are difficult to detect or prove
> if you don’t have access to the intermediate network equipment.
>
>
>
> Sean Durity
>
> *From:* Léo FERLIN SUTTON 
> *Sent:* Thursday, February 07, 2019 10:26 AM
> *To:* user@cassandra.apache.org; dinesh.jo...@yahoo.com
> *Subject:* [EXTERNAL] Re: Bootstrap keeps failing
>
>
>
> Hello !
>
> Thank you for your answers.
>
>
>
> So I have tried, multiple times, to start bootstrapping from scratch. I
> often have the same problem (on other nodes as well) but sometimes it works
> and I can move on to another node.
>
>
>
> I have joined a jstack dump and some logs.
>
>
>
> Our node was shut down at around 97% disk space used.
>
> I turned it back on and it starting the bootstrap process again.
>
>
>
> The log file is the log from this attempt, same for the thread dump.
>
>
>
> Small warning, I have somewhat anonymised the log files so there may be
> some inconsistencies.
>
>
>
> Regards,
>
>
>
> Leo
>
>
>
> On Thu, Feb 7, 2019 at 8:13 AM dinesh.jo...@yahoo.com.INVALID <
> dinesh.jo...@yahoo.com.invalid> wrote:
>
> Would it be possible for you to take a thread dump & logs and share them?
>
>
>
> Dinesh
>
>
>
>
>
> On Wednesday, February 6, 2019, 10:09:11 AM PST, Léo FERLIN SUTTON <
> lfer...@mailjet.com.INVALID> wrote:
>
>
>
>
>
> Hello !
>
>
>
> I am having a recurrent problem when trying to bootstrap a few new nodes.
>
>
>
> Some general info :
>
>- I am running cassandra 3.0.17
>- We have about 30 nodes in our cluster
>- All healthy nodes have between 60% to 90% used disk space on
>/var/lib/cassandra
>
> So I create a new node and let auto_bootstrap do it's job. After a few
> days the bootstrapping node stops streaming new data but is still not a
> member of the cluster.
>
>
>
> `nodetool status` says the node is still joining,
>
>
>
> When this happens I run `nodetool bootstrap resume`. This usually ends up
> in two different ways :
>
>1. The node fills up to 100% disk space and crashes.
>2. The bootstrap resume finishes with errors
>
> When I look at `nodetool netstats -H` is  looks like `bootstrap resume`
> does not resume but restarts a full transfer of every data from every node.
>
>
>
> This is the output I get from `nodetool resume` :
>
> [2019-02-06 01:39:14,369] received file
> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-225-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:39:16,821] received file
> /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-88-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:39:17,003] received file
> /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-89-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:39:17,032] session with /10.16.XX.YYY complete (progress:
> 2113%)
>
> [2019-02-06 01:41:15,160] received file
> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-220-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:42:02,864] received file
> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-226-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:42:09,284] received file
> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-227-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:42:10,522] received file
> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-228-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:42:10,622] received file
> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-229-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:42:11,925] received file
> /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-90-big-Data.db
> (progress: 2114%)
>
> [2019-02-06 01:42:14,887] received file
> /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-91-big-Data.db
> (progress: 2114%)
>
> [2019-02-06 01:42:14,980] session with /10.16.X

RE: [EXTERNAL] Re: Bootstrap keeps failing

2019-02-07 Thread Durity, Sean R
I have seen unreliable streaming (streaming that doesn’t finish) because of TCP 
timeouts from firewalls or switches. The default tcp_keepalive kernel 
parameters are usually not tuned for that. See 
https://docs.datastax.com/en/dse-trblshoot/doc/troubleshooting/idleFirewallLinux.html
 for more details. These “remote” timeouts are difficult to detect or prove if 
you don’t have access to the intermediate network equipment.

Sean Durity
From: Léo FERLIN SUTTON 
Sent: Thursday, February 07, 2019 10:26 AM
To: user@cassandra.apache.org; dinesh.jo...@yahoo.com
Subject: [EXTERNAL] Re: Bootstrap keeps failing

Hello !

Thank you for your answers.

So I have tried, multiple times, to start bootstrapping from scratch. I often 
have the same problem (on other nodes as well) but sometimes it works and I can 
move on to another node.

I have joined a jstack dump and some logs.

Our node was shut down at around 97% disk space used.
I turned it back on and it starting the bootstrap process again.

The log file is the log from this attempt, same for the thread dump.

Small warning, I have somewhat anonymised the log files so there may be some 
inconsistencies.

Regards,

Leo

On Thu, Feb 7, 2019 at 8:13 AM 
dinesh.jo...@yahoo.com.INVALID<mailto:dinesh.jo...@yahoo.com.INVALID> 
mailto:dinesh.jo...@yahoo.com.invalid>> wrote:
Would it be possible for you to take a thread dump & logs and share them?

Dinesh


On Wednesday, February 6, 2019, 10:09:11 AM PST, Léo FERLIN SUTTON 
mailto:lfer...@mailjet.com.INVALID>> wrote:


Hello !

I am having a recurrent problem when trying to bootstrap a few new nodes.

Some general info :

  *   I am running cassandra 3.0.17
  *   We have about 30 nodes in our cluster
  *   All healthy nodes have between 60% to 90% used disk space on 
/var/lib/cassandra
So I create a new node and let auto_bootstrap do it's job. After a few days the 
bootstrapping node stops streaming new data but is still not a member of the 
cluster.

`nodetool status` says the node is still joining,

When this happens I run `nodetool bootstrap resume`. This usually ends up in 
two different ways :

  1.  The node fills up to 100% disk space and crashes.
  2.  The bootstrap resume finishes with errors
When I look at `nodetool netstats -H` is  looks like `bootstrap resume` does 
not resume but restarts a full transfer of every data from every node.

This is the output I get from `nodetool resume` :
[2019-02-06 01:39:14,369] received file 
/var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-225-big-Data.db
 (progress: 2113%)
[2019-02-06 01:39:16,821] received file 
/var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-88-big-Data.db
 (progress: 2113%)
[2019-02-06 01:39:17,003] received file 
/var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-89-big-Data.db
 (progress: 2113%)
[2019-02-06 01:39:17,032] session with /10.16.XX.YYY complete (progress: 2113%)
[2019-02-06 01:41:15,160] received file 
/var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-220-big-Data.db
 (progress: 2113%)
[2019-02-06 01:42:02,864] received file 
/var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-226-big-Data.db
 (progress: 2113%)
[2019-02-06 01:42:09,284] received file 
/var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-227-big-Data.db
 (progress: 2113%)
[2019-02-06 01:42:10,522] received file 
/var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-228-big-Data.db
 (progress: 2113%)
[2019-02-06 01:42:10,622] received file 
/var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-229-big-Data.db
 (progress: 2113%)
[2019-02-06 01:42:11,925] received file 
/var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-90-big-Data.db
 (progress: 2114%)
[2019-02-06 01:42:14,887] received file 
/var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-91-big-Data.db
 (progress: 2114%)
[2019-02-06 01:42:14,980] session with /10.16.XX.ZZZ complete (progress: 2114%)
[2019-02-06 01:42:14,980] Stream failed
[2019-02-06 01:42:14,982] Error during bootstrap: Stream failed
[2019-02-06 01:42:14,982] Resume bootstrap complete

The bootstrap `progress` goes way over 100% and eventually fails.


Right now I have a node with this output from `nodetool status` :
`UJ  10.16.XX.YYY  2.93 TB256  ? 
5788f061-a3c0-46af-b712-ebeecd397bf7  c`

It is almost filled with data, yet if I look at `nodetool netstats` :
Receiving 480 files, 325.39 GB total. Already received 5 files, 68.32 
MB total
Receiving 499 files, 328.96 GB total. Already received 1 files, 1.32 GB 
total
Receiving 506 files, 345.33 GB total. Already received 6 files, 24.19 
MB total
Receiving 362 files, 206.73 GB total. Already received 7 files, 34 MB 
total
Receiving 424 file