[ 
https://issues.apache.org/jira/browse/CASSANDRA-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094688#comment-17094688
 ] 

David Capwell commented on CASSANDRA-14831:
-------------------------------------------

Based off the logs it looks like the repair coordinator was notified of the 
error but didn't update the flag to say that the repair was failed (which 
nodetool pools on); I believe this should be fixed in CASSANDRA-15564.

The next issue was "Cannot start multiple repair sessions over the same 
sstables" 
([source|https://github.com/apache/cassandra/blob/cassandra-3.11.1/src/java/org/apache/cassandra/service/ActiveRepairService.java#L559]);
 this implies that the metadata on the node still thinks the node is repairing. 
 I feel that this falls into the same class of bugs that CASSANDRA-15566 calls 
out. Most of the focus of CASSANDRA-15566 was on full repair (IR is based off 
this, so benefits from it), so there may be even more issues with IR.

> Nodetool repair hangs with java.net.SocketException: End-of-stream reached
> --------------------------------------------------------------------------
>
>                 Key: CASSANDRA-14831
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14831
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Repair
>            Reporter: Tania S Engel
>            Assignee: David Capwell
>            Priority: Normal
>             Fix For: 3.11.x
>
>         Attachments: Cassandra - 14831 Logs.mht
>
>
> Using Cassandra 3.11.1.
> Ran >nodetool repair <keyspacename> on a small 3 node cluster  from node 
> 3eef. Node 9160 and 3f5e experienced a stream failure. 
> *NODE 9160:* 
> ERROR [STREAM-IN-/fd70:616e:6761:6561:ae1f:6bff:fe12:3f5e:7000] 2018-10-16 
> 01:45:00,400 StreamSession.java:593 - [Stream 
> #103fe070-d0e5-11e8-a993-5929a1c131b4] Streaming error occurred on session 
> with peer fd70:616e:6761:6561:ae1f:6bff:fe12:3f5e
> *java.net.SocketException: End-of-stream reached*
> at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:71)
>  ~[apache-cassandra-3.11.1.jar:3.11.1]
> at 
> org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:311)
>  ~[apache-cassandra-3.11.1.jar:3.11.1]
> at java.lang.Thread.run(Thread.java:748) [na:1.8.0_152]
>  
> *NODE 3f5e:*
> ERROR [STREAM-IN-/fd70:616e:6761:6561:ec4:7aff:fece:9160:59676] 2018-10-16 
> 01:45:09,474 StreamSession.java:593 - [Stream 
> #103ef610-d0e5-11e8-a993-5929a1c131b4] Streaming error occurred on session 
> with peer fd70:616e:6761:6561:ec4:7aff:fece:9160
> java.io.IOException: An existing connection was forcibly closed by the remote 
> host
> at sun.nio.ch.SocketDispatcher.read0(Native Method) ~[na:1.8.0_152]
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43) ~[na:1.8.0_152]
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.8.0_152]
> at sun.nio.ch.IOUtil.read(IOUtil.java:197) ~[na:1.8.0_152]
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) 
> ~[na:1.8.0_152]
> at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:206) 
> ~[na:1.8.0_152]
> at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) 
> ~[na:1.8.0_152]
> at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385) 
> ~[na:1.8.0_152]
> at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:56)
>  ~[apache-cassandra-3.11.1.jar:3.11.1]
> at 
> org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:311)
>  ~[apache-cassandra-3.11.1.jar:3.11.1]
> at java.lang.Thread.run(Thread.java:748) [na:1.8.0_152]
>  
> *NODE 3EEF:*
> ERROR [RepairJobTask:14] 2018-10-16 01:45:00,457 RepairSession.java:281 - 
> [repair #f2ab3eb0-d0e4-11e8-9926-bf64f35712c1] Session completed with the 
> following error
> org.apache.cassandra.exceptions.RepairException: [repair 
> #f2ab3eb0-d0e4-11e8-9926-bf64f35712c1 on logs/{color:#333333}XXXXXX{color}, 
> [(-8271925838625565988,-8266397600493941101], 
> (2290821710735817606,2299380749828706426] 
> …(-8701313305140908434,-8686533141993948378]]] Sync failed between 
> /fd70:616e:6761:6561:ec4:7aff:fece:9160 and 
> /fd70:616e:6761:6561:ae1f:6bff:fe12:3f5e
> at 
> org.apache.cassandra.repair.RemoteSyncTask.syncComplete(RemoteSyncTask.java:67)
>  ~[apache-cassandra-3.11.1.jar:3.11.1]
> at 
> org.apache.cassandra.repair.RepairSession.syncComplete(RepairSession.java:202)
>  ~[apache-cassandra-3.11.1.jar:3.11.1]
> at 
> org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:495)
>  ~[apache-cassandra-3.11.1.jar:3.11.1]
> at 
> org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:162)
>  ~[apache-cassandra-3.11.1.jar:3.11.1]
> at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) 
> ~[apache-cassandra-3.11.1.jar:3.11.1]
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_152]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_152]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [na:1.8.0_152]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [na:1.8.0_152]
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
>  [apache-cassandra-3.11.1.jar:3.11.1]
> at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_152]
>  
> ERROR [RepairJobTask:14] 2018-10-16 01:45:00,459 RepairRunnable.java:276 - 
> Repair session f2ab3eb0-d0e4-11e8-9926-bf64f35712c1 for range 
> [(-8271925838625565988,-8266397600493941101],…(-6146831664074703724,-6117107236121156255],
>  (4842256698807887573,4848113042863615717], 
> (-8701313305140908434,-8686533141993948378]] failed with error [repair 
> #f2ab3eb0-d0e4-11e8-9926-bf64f35712c1 on 
> logs/auditsearchlog,…(-8701313305140908434,-8686533141993948378]]] Sync 
> failed between /fd70:616e:6761:6561:ec4:7aff:fece:9160 and 
> /fd70:616e:6761:6561:ae1f:6bff:fe12:3f5e
> org.apache.cassandra.exceptions.RepairException: [repair 
> #f2ab3eb0-d0e4-11e8-9926-bf64f35712c1 on logs/auditsearchlog, 
> [(-8271925838625565988,-8266397600493941101], 
> …(4842256698807887573,4848113042863615717], 
> (-8701313305140908434,-8686533141993948378]]] Sync failed between 
> /fd70:616e:6761:6561:ec4:7aff:fece:9160 and 
> /fd70:616e:6761:6561:ae1f:6bff:fe12:3f5e
> at 
> org.apache.cassandra.repair.RemoteSyncTask.syncComplete(RemoteSyncTask.java:67)
>  ~[apache-cassandra-3.11.1.jar:3.11.1]
> at 
> org.apache.cassandra.repair.RepairSession.syncComplete(RepairSession.java:202)
>  ~[apache-cassandra-3.11.1.jar:3.11.1]
> at 
> org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:495)
>  ~[apache-cassandra-3.11.1.jar:3.11.1]
> at 
> org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:162)
>  ~[apache-cassandra-3.11.1.jar:3.11.1]
> at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) 
> ~[apache-cassandra-3.11.1.jar:3.11.1]
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_152]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_152]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [na:1.8.0_152]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [na:1.8.0_152]
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
>  [apache-cassandra-3.11.1.jar:3.11.1]
> at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_152]
>  
> *NODETOOL OUTPUT: shows the failure but then never returns.* 
>  
> [2018-10-16 01:43:57,310] Starting repair command #8 
> (f26bc4b0-d0e4-11e8-9926-bf64f35712c1), repairing keyspace logs with repair 
> options (parallelism: parallel, primary range: false, incremental: true, job 
> threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 768, 
> pull repair: false)
> [2018-10-16 01:45:00,462] Repair session f2ab3eb0-d0e4-11e8-9926-bf64f35712c1 
> for range …
> (4842256698807887573,4848113042863615717], 
> (-8701313305140908434,-8686533141993948378]] failed with error [repair 
> #f2ab3eb0-d0e4-11e8-9926-bf64f35712c1 on logs/XXXXXX,
> [(-8271925838625565988,-8266397600493941101], 
> (2290821710735817606,2299380749828706426], …
> (4842256698807887573,4848113042863615717], 
> (-8701313305140908434,-8686533141993948378]]] Sync failed between 
> /fd70:616e:6761:6561:ec4:7aff:fece:9160 and 
> /fd70:616e:6761:6561:ae1f:6bff:fe12:3f5e (progress: 0%)
>  
> The streaming does continue between the 3 nodes. See the attached partial 
> logs from all 3 nodes. Then it stops. We never see that the repair command 
> finished. Then about 15 hours later, we run  >nodetool repair logs again. It 
> fails.  This time the error indicates there is an active repair session.  The 
> only thing that seemed to get us out of this state was reboot of all the 
> nodes.
>  
> *ERROR [ValidationExecutor:27] 2018-10-16 17:14:39,241 
> ActiveRepairService.java:558 - Cannot start multiple repair sessions over the 
> same sstables*
> ERROR [ValidationExecutor:27] 2018-10-16 17:14:39,241 Validator.java:268 - 
> Failed creating a merkle tree for [repair 
> #da436780-d166-11e8-9926-bf64f35712c1 on logs/YYYYY, 
> [(-8271925838625565988,-8266397600493941101], ...
> /fd70:616e:6761:6561:ae1f:6bff:fe12:3ee4 (see log for details)
> ERROR [ValidationExecutor:27] 2018-10-16 17:14:39,244 
> CassandraDaemon.java:228 - Exception in thread 
> Thread[ValidationExecutor:27,1,main]
> java.lang.RuntimeException: Cannot start multiple repair sessions over the 
> same sstables
>  at 
> org.apache.cassandra.service.ActiveRepairService$ParentRepairSession.markSSTablesRepairing(ActiveRepairService.java:559)
>  ~[apache-cassandra-3.11.1.jar:3.11.1]
>  at 
> org.apache.cassandra.db.compaction.CompactionManager.getSSTablesToValidate(CompactionManager.java:1446)
>  ~[apache-cassandra-3.11.1.jar:3.11.1]
>  at 
> org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:1348)
>  ~[apache-cassandra-3.11.1.jar:3.11.1]
>  at 
> org.apache.cassandra.db.compaction.CompactionManager.access$700(CompactionManager.java:86)
>  ~[apache-cassandra-3.11.1.jar:3.11.1]
>  at 
> org.apache.cassandra.db.compaction.CompactionManager$13.call(CompactionManager.java:942)
>  ~[apache-cassandra-3.11.1.jar:3.11.1]
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_152]
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  ~[na:1.8.0_152]
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [na:1.8.0_152]
>  at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
>  [apache-cassandra-3.11.1.jar:3.11.1]
>  at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_152]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to