Hi, Recently I have enabled incremental repair in one of my test cluster setup which consists of 8 nodes(DC1 - 4, DC2 - 4) with C* version of 2.1.13. Currently, I am facing node failure scenario in this cluster with the following exception during the incremental repair process
exception occurred during clean-up. java.lang.reflect. UndeclaredThrowableException error: JMX connection closed. You should check server log for repair status of keyspace VERTICALCRM(Subsequent keyspaces are not going to be repaired). -- StackTrace -- java.io.IOException: JMX connection closed. You should check server log for repair status of keyspace VERTICAL(Subsequent keyspaces are not going to be repaired). at org.apache.cassandra.tools.RepairRunner. handleNotification(NodeProbe.java:1496) at javax.management.NotificationBroadcasterSupport .handleNotification(NotificationBroadcasterSupport.java:275) at javax.management.NotificationBroadcasterSupport$SendNotifJob.run( NotificationBroadcasterSupport.java:352) at javax.management.NotificationBroadcasterSupport$1.execute( NotificationBroadcasterSupport.java:337) at javax.management.NotificationBroadcasterSupport.sendNotification( NotificationBroadcasterSupport.java:248) at javax.management.remote.rmi.RMIConnector.sendNotification( RMIConnector.java:441) at javax.management.remote.rmi.RMIConnector.access$1200( RMIConnector.java:121) at javax.management.remote.rmi.RMIConnector$ RMIClientCommunicatorAdmin.gotIOException(RMIConnector.java:1531) at javax.management.remote.rmi.RMIConnector$RMINotifClient. fetchNotifs(RMIConnector.java:1352) at com.sun.jmx.remote.internal.ClientNotifForwarder$ NotifFetcher.fetchOneNotif(ClientNotifForwarder.java:655) at com.sun.jmx.remote.internal.ClientNotifForwarder$ NotifFetcher.fetchNotifs(ClientNotifForwarder.java:607) at com.sun.jmx.remote.internal.ClientNotifForwarder$ NotifFetcher.doRun(ClientNotifForwarder.java:471) at com.sun.jmx.remote.internal.ClientNotifForwarder$ NotifFetcher.run(ClientNotifForwarder.java:452) at com.sun.jmx.remote.internal.ClientNotifForwarder$ LinearExecutor$1.run(ClientNotifForwarder.java:108) And the node was made down by this exception. When I tried to starting the same node I got following exception java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.io.compress.CompressedRandomAccessReader.< init>(CompressedRandomAccessReader.java:73) at org.apache.cassandra.io.compress.CompressedRandomAccessReader. open(CompressedRandomAccessReader.java:48) at org.apache.cassandra.io.util.CompressedPoolingSegmentedFile .createPooledReader(CompressedPoolingSegmentedFile.java:95) at org.apache.cassandra.io.util.PoolingSegmentedFile.getSegment( PoolingSegmentedFile.java:62) at org.apache.cassandra.io.sstable.SSTableReader.getFileDataInput( SSTableReader.java:1902) at org.apache.cassandra.db.columniterator.SimpleSliceReader.<init>( SimpleSliceReader.java:57) at org.apache.cassandra.db.columniterator.SSTableSliceIterator. createReader(SSTableSliceIterator.java:65) at org.apache.cassandra.db.columniterator. SSTableSliceIterator.<init>(SSTableSliceIterator.java:42) at org.apache.cassandra.db.filter.SliceQueryFilter. getSSTableColumnIterator(SliceQueryFilter.java:246) at org.apache.cassandra.db.filter.QueryFilter. getSSTableColumnIterator(QueryFilter.java:62) at org.apache.cassandra.db.CollationController.collectAllData( CollationController.java:270) at org.apache.cassandra.db.CollationController.getTopLevelColumns( CollationController.java:65) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns( ColumnFamilyStore.java:2001) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily( ColumnFamilyStore.java:1844) at org.apache.cassandra.db.Keyspace.getRow(Keyspace.java:353) at org.apache.cassandra.db.SliceFromReadCommand.getRow( SliceFromReadCommand.java:85) at org.apache.cassandra.cql3.statements.SelectStatement. readLocally(SelectStatement.java:309) at org.apache.cassandra.cql3.statements.SelectStatement. executeInternal(SelectStatement.java:328) at org.apache.cassandra.cql3.statements.SelectStatement. executeInternal(SelectStatement.java:67) at org.apache.cassandra.cql3.QueryProcessor.executeInternal( QueryProcessor.java:317) at org.apache.cassandra.db.SystemKeyspace.getSSTableReadMeter( SystemKeyspace.java:972) at org.apache.cassandra.io.sstable.SSTableReader$ GlobalTidy.ensureReadMeter(SSTableReader.java:2388) at org.apache.cassandra.io.sstable.SSTableReader$ InstanceTidier.setup(SSTableReader.java:2204) at org.apache.cassandra.io.sstable.SSTableReader.setup( SSTableReader.java:2145) at org.apache.cassandra.io.sstable.SSTableReader.open( SSTableReader.java:491) at org.apache.cassandra.io.sstable.SSTableReader.open( SSTableReader.java:384) at org.apache.cassandra.io.sstable.SSTableReader$4.run( SSTableReader.java:531) at java.util.concurrent.Executors$RunnableAdapter. call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) While debugging the issue, I started checking the SSTable count which had been drastically increased ( say from 12 SStables to 2017 SStables in a particular CF ). Can anyone explain what goes wrong here? and is there any possible way to start the node which is currently down? P.S : I have followed all the *Recommeded Production settings* for this cluster