[jira] [Commented] (CASSANDRA-10992) Hanging streaming sessions

Paulo Motta (JIRA) Thu, 23 Jun 2016 13:59:38 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347174#comment-15347174
 ]


Paulo Motta commented on CASSANDRA-10992:
-----------------------------------------

I created a 
[dtest|https://github.com/riptano/cassandra-dtest/compare/master...pauloricardomg:10992#diff-0b30b9f097df89d74be1d1af8205ac7eR147]
 to reproduce this with the following sequence of steps:
- create 2 node cluster with data on compressed table
- bootstrap new node
- fail one of the original nodes during bootstrap
- expect bootstrap to fail

Currently bootstrap never completes or fails because when 
{{CompressedStreamReader}} gets an {{IOException}} due to the node failure it 
tries to drain the {{CompressedInputStream}} for a possible retry, but the data 
buffer is never filled by the {{Reader}} thread because the socket is closed, 
so the stream session blocks on drain and never completes.

Since we never retry on {{IOException}} the fix is to avoid draining the socket 
on {{IOException}}. Furthermore, I updated {{CompressedInputStream}} to cache 
the exception and re-throw it in case someone tries to read it again from the 
failed input stream.

The patch and CI runs are available below:
||2.1||2.2||3.0||trunk||dtest||
|[branch|https://github.com/apache/cassandra/compare/cassandra-2.1...pauloricardomg:2.1-10992]|[branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:2.2-10992]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.0...pauloricardomg:3.0-10992]|[branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:trunk-10992]|[branch|https://github.com/riptano/cassandra-dtest/compare/master...pauloricardomg:10992]|
|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-10992-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-10992-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-10992-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-10992-testall/lastCompletedBuild/testReport/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-10992-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-10992-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-10992-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-10992-dtest/lastCompletedBuild/testReport/]|

On a related note, it's not very clear to me from what conditions retry is 
trying to recover from, and draining based on consumed size is a bit fragile 
since we may block on the socket/CIS if there is a size mismatch due to a bug 
(a la CASSANDRA-10005). I think it's safer to drain based on a magic number, so 
we should probably consider that in the future. With that said, I wonder if we 
should disable retry altogether because I think that in its current state it 
brings more pain than benefits.

> Hanging streaming sessions
> --------------------------
>
>                 Key: CASSANDRA-10992
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10992
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: C* 2.1.12, Debian Wheezy
>            Reporter: mlowicki
>            Assignee: Paulo Motta
>             Fix For: 2.1.12
>
>         Attachments: apache-cassandra-2.1.12-SNAPSHOT.jar, db1.ams.jstack, 
> db6.analytics.jstack
>
>
> I've started recently running repair using [Cassandra 
> Reaper|https://github.com/spotify/cassandra-reaper]  (built-in {{nodetool 
> repair}} doesn't work for me - CASSANDRA-9935). It behaves fine but I've 
> noticed hanging streaming sessions:
> {code}
> root@db1:~# date
> Sat Jan  9 16:43:00 UTC 2016
> root@db1:~# nt netstats -H | grep total
>         Receiving 5 files, 46.59 MB total. Already received 1 files, 11.32 MB 
> total
>         Sending 7 files, 46.28 MB total. Already sent 7 files, 46.28 MB total
>         Receiving 6 files, 64.15 MB total. Already received 1 files, 12.14 MB 
> total
>         Sending 5 files, 61.15 MB total. Already sent 5 files, 61.15 MB total
>         Receiving 4 files, 7.75 MB total. Already received 3 files, 7.58 MB 
> total
>         Sending 4 files, 4.29 MB total. Already sent 4 files, 4.29 MB total
>         Receiving 12 files, 13.79 MB total. Already received 11 files, 7.66 
> MB total
>         Sending 5 files, 15.32 MB total. Already sent 5 files, 15.32 MB total
>         Receiving 8 files, 20.35 MB total. Already received 1 files, 13.63 MB 
> total
>         Sending 38 files, 125.34 MB total. Already sent 38 files, 125.34 MB 
> total
> root@db1:~# date
> Sat Jan  9 17:45:42 UTC 2016
> root@db1:~# nt netstats -H | grep total
>         Receiving 5 files, 46.59 MB total. Already received 1 files, 11.32 MB 
> total
>         Sending 7 files, 46.28 MB total. Already sent 7 files, 46.28 MB total
>         Receiving 6 files, 64.15 MB total. Already received 1 files, 12.14 MB 
> total
>         Sending 5 files, 61.15 MB total. Already sent 5 files, 61.15 MB total
>         Receiving 4 files, 7.75 MB total. Already received 3 files, 7.58 MB 
> total
>         Sending 4 files, 4.29 MB total. Already sent 4 files, 4.29 MB total
>         Receiving 12 files, 13.79 MB total. Already received 11 files, 7.66 
> MB total
>         Sending 5 files, 15.32 MB total. Already sent 5 files, 15.32 MB total
>         Receiving 8 files, 20.35 MB total. Already received 1 files, 13.63 MB 
> total
>         Sending 38 files, 125.34 MB total. Already sent 38 files, 125.34 MB 
> total
> {code}
> Such sessions are left even when repair job is long time done (confirmed by 
> checking Reaper's and Cassandra's logs). {{streaming_socket_timeout_in_ms}} 
> in cassandra.yaml is set to default value (3600000).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10992) Hanging streaming sessions

Reply via email to