Hi,
Recently I have enabled incremental repair in one of my test cluster setup
which consists of 8 nodes(DC1 - 4, DC2 - 4) with C* version of 2.1.13.
Currently, I am facing node failure scenario in this cluster with the
following exception during the incremental repair process

exception occurred during clean-up.  java.lang.reflect.
UndeclaredThrowableException
error: JMX connection closed. You should check server log for repair status
of keyspace VERTICALCRM(Subsequent keyspaces are not going to be repaired).
-- StackTrace --
java.io.IOException: JMX connection closed. You should check server log for
repair status of keyspace VERTICAL(Subsequent keyspaces are not going to be
repaired).
        at org.apache.cassandra.tools.RepairRunner.
handleNotification(NodeProbe.java:1496)
        at javax.management.NotificationBroadcasterSupport
.handleNotification(NotificationBroadcasterSupport.java:275)
        at javax.management.NotificationBroadcasterSupport$SendNotifJob.run(
NotificationBroadcasterSupport.java:352)
        at javax.management.NotificationBroadcasterSupport$1.execute(
NotificationBroadcasterSupport.java:337)
        at javax.management.NotificationBroadcasterSupport.sendNotification(
NotificationBroadcasterSupport.java:248)
        at javax.management.remote.rmi.RMIConnector.sendNotification(
RMIConnector.java:441)
        at javax.management.remote.rmi.RMIConnector.access$1200(
RMIConnector.java:121)
        at javax.management.remote.rmi.RMIConnector$
RMIClientCommunicatorAdmin.gotIOException(RMIConnector.java:1531)
        at javax.management.remote.rmi.RMIConnector$RMINotifClient.
fetchNotifs(RMIConnector.java:1352)
        at com.sun.jmx.remote.internal.ClientNotifForwarder$
NotifFetcher.fetchOneNotif(ClientNotifForwarder.java:655)
        at com.sun.jmx.remote.internal.ClientNotifForwarder$
NotifFetcher.fetchNotifs(ClientNotifForwarder.java:607)
        at com.sun.jmx.remote.internal.ClientNotifForwarder$
NotifFetcher.doRun(ClientNotifForwarder.java:471)
        at com.sun.jmx.remote.internal.ClientNotifForwarder$
NotifFetcher.run(ClientNotifForwarder.java:452)
        at com.sun.jmx.remote.internal.ClientNotifForwarder$
LinearExecutor$1.run(ClientNotifForwarder.java:108)

And the node was made down by this exception.

When I tried to starting the same node I got following exception

java.lang.OutOfMemoryError: Java heap space
        at org.apache.cassandra.io.compress.CompressedRandomAccessReader.<
init>(CompressedRandomAccessReader.java:73)
        at org.apache.cassandra.io.compress.CompressedRandomAccessReader.
open(CompressedRandomAccessReader.java:48)
        at org.apache.cassandra.io.util.CompressedPoolingSegmentedFile
.createPooledReader(CompressedPoolingSegmentedFile.java:95)
        at org.apache.cassandra.io.util.PoolingSegmentedFile.getSegment(
PoolingSegmentedFile.java:62)
        at org.apache.cassandra.io.sstable.SSTableReader.getFileDataInput(
SSTableReader.java:1902)
        at org.apache.cassandra.db.columniterator.SimpleSliceReader.<init>(
SimpleSliceReader.java:57)
        at org.apache.cassandra.db.columniterator.SSTableSliceIterator.
createReader(SSTableSliceIterator.java:65)
        at org.apache.cassandra.db.columniterator.
SSTableSliceIterator.<init>(SSTableSliceIterator.java:42)
        at org.apache.cassandra.db.filter.SliceQueryFilter.
getSSTableColumnIterator(SliceQueryFilter.java:246)
        at org.apache.cassandra.db.filter.QueryFilter.
getSSTableColumnIterator(QueryFilter.java:62)
        at org.apache.cassandra.db.CollationController.collectAllData(
CollationController.java:270)
        at org.apache.cassandra.db.CollationController.getTopLevelColumns(
CollationController.java:65)
        at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(
ColumnFamilyStore.java:2001)
        at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(
ColumnFamilyStore.java:1844)
        at org.apache.cassandra.db.Keyspace.getRow(Keyspace.java:353)
        at org.apache.cassandra.db.SliceFromReadCommand.getRow(
SliceFromReadCommand.java:85)
        at org.apache.cassandra.cql3.statements.SelectStatement.
readLocally(SelectStatement.java:309)
        at org.apache.cassandra.cql3.statements.SelectStatement.
executeInternal(SelectStatement.java:328)
        at org.apache.cassandra.cql3.statements.SelectStatement.
executeInternal(SelectStatement.java:67)
        at org.apache.cassandra.cql3.QueryProcessor.executeInternal(
QueryProcessor.java:317)
        at org.apache.cassandra.db.SystemKeyspace.getSSTableReadMeter(
SystemKeyspace.java:972)
        at org.apache.cassandra.io.sstable.SSTableReader$
GlobalTidy.ensureReadMeter(SSTableReader.java:2388)
        at org.apache.cassandra.io.sstable.SSTableReader$
InstanceTidier.setup(SSTableReader.java:2204)
        at org.apache.cassandra.io.sstable.SSTableReader.setup(
SSTableReader.java:2145)
        at org.apache.cassandra.io.sstable.SSTableReader.open(
SSTableReader.java:491)
        at org.apache.cassandra.io.sstable.SSTableReader.open(
SSTableReader.java:384)
        at org.apache.cassandra.io.sstable.SSTableReader$4.run(
SSTableReader.java:531)
        at java.util.concurrent.Executors$RunnableAdapter.
call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

While debugging the issue, I started checking the SSTable count which had
been drastically increased ( say from 12 SStables to 2017 SStables in a
particular CF  ). Can anyone explain what goes wrong here? and is there any
possible way to start the node which is currently down?

P.S : I have followed all the *Recommeded Production settings* for this
cluster

Reply via email to