[ 
https://issues.apache.org/jira/browse/CASSANDRA-11906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410907#comment-15410907
 ] 

Stefano Ortolani commented on CASSANDRA-11906:
----------------------------------------------

Unfortunately I might need to reopen this issues since one node died last night 
while doing a sequential repair.

{noformat}
INFO  [STREAM-IN-/10.12.81.6] 2016-08-07 00:07:11,145 
StreamResultFuture.java:169 - [Stream #e276de60-5c32-11e6-b182-e548aa39c6eb 
ID#0] Prepare completed. Receiving 2 files(1974371 bytes), sending 2 
files(2101094 bytes)
INFO  [AntiEntropyStage:1] 2016-08-07 00:07:11,895 RepairSession.java:181 - 
[repair #739ae1c1-5c2e-11e6-b182-e548aa39c6eb] Received merkle tree for 
llkgbsubject2sample from /10.12.33.6
INFO  [AntiEntropyStage:1] 2016-08-07 00:07:11,895 RepairJob.java:213 - 
Validating /10.12.81.6
WARN  [STREAM-IN-/10.12.81.6] 2016-08-07 00:07:11,970 CLibrary.java:280 - 
open(/data/cassandra/data/schema/data-cf-685a935014a311e69af3cdff32bdcb0d, 
O_RDONLY) failed, errno (24).
WARN  [STREAM-IN-/10.12.33.6] 2016-08-07 00:07:11,970 CLibrary.java:280 - 
open(/data/cassandra/data/schema/data-cf-685a935014a311e69af3cdff32bdcb0d, 
O_RDONLY) failed, errno (24).
WARN  [STREAM-IN-/10.12.81.6] 2016-08-07 00:07:11,970 
CompressedStreamReader.java:118 - [Stream e276de60-5c32-11e6-b182-e548aa39c6eb] 
Error while reading partition null from stream on ks='schema' and 
table='data-cf'.
ERROR [STREAM-IN-/10.12.81.6] 2016-08-07 00:07:11,984 StreamSession.java:528 - 
[Stream #e276de60-5c32-11e6-b182-e548aa39c6eb] Streaming error occurred
java.nio.file.FileSystemException: 
/data/cassandra/data/schema/data-cf-685a935014a311e69af3cdff32bdcb0d/mb-4984-big-Index.db:
 Too many open files
        at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) 
~[na:1.8.0_91]
        at 
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) 
~[na:1.8.0_91]
        at 
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) 
~[na:1.8.0_91]
        at 
sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
 ~[na:1.8.0_91]
        at java.nio.channels.FileChannel.open(FileChannel.java:287) 
~[na:1.8.0_91]
        at java.nio.channels.FileChannel.open(FileChannel.java:335) 
~[na:1.8.0_91]
        at 
org.apache.cassandra.io.util.SequentialWriter.openChannel(SequentialWriter.java:114)
 ~[apache-cassandra-3.0.8.jar:3.0.8]
        at 
org.apache.cassandra.io.util.SequentialWriter.<init>(SequentialWriter.java:135) 
~[apache-cassandra-3.0.8.jar:3.0.8]
        at 
org.apache.cassandra.io.util.SequentialWriter.open(SequentialWriter.java:150) 
~[apache-cassandra-3.0.8.jar:3.0.8]
        at 
org.apache.cassandra.io.sstable.format.big.BigTableWriter$IndexWriter.<init>(BigTableWriter.java:375)
 ~[apache-cassandra-3.0.8.jar:3.0.8]
        at 
org.apache.cassandra.io.sstable.format.big.BigTableWriter.<init>(BigTableWriter.java:84)
 ~[apache-cassandra-3.0.8.jar:3.0.8]
        at 
org.apache.cassandra.io.sstable.format.big.BigFormat$WriterFactory.open(BigFormat.java:93)
 ~[apache-cassandra-3.0.8.jar:3.0.8]
        at 
org.apache.cassandra.io.sstable.format.SSTableWriter.create(SSTableWriter.java:96)
 ~[apache-cassandra-3.0.8.jar:3.0.8]
        at 
org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.create(SimpleSSTableMultiWriter.java:114)
 ~[apache-cassandra-3.0.8.jar:3.0.8]
        at 
org.apache.cassandra.db.compaction.AbstractCompactionStrategy.createSSTableMultiWriter(AbstractCompactionStrategy.java:516)
 ~[apache-cassandra-3.0.8.jar:3.0.8]
        at 
org.apache.cassandra.db.compaction.CompactionStrategyManager.createSSTableMultiWriter(CompactionStrategyManager.java:498)
 ~[apache-cassandra-3.0.8.jar:3.0.8]
        at 
org.apache.cassandra.db.ColumnFamilyStore.createSSTableMultiWriter(ColumnFamilyStore.java:481)
 ~[apache-cassandra-3.0.8.jar:3.0.8]
        at 
org.apache.cassandra.db.ColumnFamilyStore.createSSTableMultiWriter(ColumnFamilyStore.java:476)
 ~[apache-cassandra-3.0.8.jar:3.0.8]
        at 
org.apache.cassandra.streaming.StreamReader.createWriter(StreamReader.java:162) 
~[apache-cassandra-3.0.8.jar:3.0.8]
        at 
org.apache.cassandra.streaming.compress.CompressedStreamReader.read(CompressedStreamReader.java:92)
 ~[apache-cassandra-3.0.8.jar:3.0.8]
        at 
org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:50)
 ~[apache-cassandra-3.0.8.jar:3.0.8]
        at 
org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:39)
 ~[apache-cassandra-3.0.8.jar:3.0.8]
        at 
org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:59)
 ~[apache-cassandra-3.0.8.jar:3.0.8]
        at 
org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:287)
 ~[apache-cassandra-3.0.8.jar:3.0.8]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
ERROR [STREAM-IN-/10.12.33.6] 2016-08-07 00:07:12,043 
JVMStabilityInspector.java:140 - JVM state determined to be unstable.  Exiting 
forcefully due to:
java.io.FileNotFoundException: 
/data/cassandra/data/schema/data-cf-685a935014a311e69af3cdff32bdcb0d/
{noformat}

> Unstable JVM due too many files when anticompacting big LCS tables
> ------------------------------------------------------------------
>
>                 Key: CASSANDRA-11906
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11906
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Stefano Ortolani
>
> I have recently moved from C* 2.1.x to C* 3.0.6. The setup is quite 
> heavy:
>       - 13 nodes with spinning disks
>       - ~120 GB of data per node
>       - 50% of CFs are LCS and have quite wide rows.
>       - 2/3 CFs with LCS have more than 200 SStables
> Incremental repairs do not seem to play really well with that.
> I have been running some tests node by node by using the -pr option:
> {code:xml}
> nodetool -h localhost repair -pr keyscheme
> {code}
> and to my surprise the whole process takes quite some time (1 hour
> minimum, 8 hours if I haven't been repairing for 5/6 days).
> Yesterday I tried to run the command with the -seq option so to 
> decrease the number of simultanoues compactions. After a while
> the node on which I was running the repair simply died during
> the anticompaction phase with the following
> exception in the logs.
> {code:xml}
> ERROR [metrics-graphite-reporter-1-thread-1] 2016-05-25 21:54:21,868 
> ScheduledReporter.java:119 - RuntimeException thrown from 
> GraphiteReporter#report. Exception was suppressed.
> java.lang.RuntimeException: Failed to list files in 
> /data/cassandra/data/keyschema/columnfamily-3996ce80b7ac11e48a9b6776bf484396
>       at 
> org.apache.cassandra.db.lifecycle.LogAwareFileLister.list(LogAwareFileLister.java:57)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.lifecycle.LifecycleTransaction.getFiles(LifecycleTransaction.java:547)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.Directories$SSTableLister.filter(Directories.java:691)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.Directories$SSTableLister.listFiles(Directories.java:662)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.Directories$TrueFilesSizeVisitor.<init>(Directories.java:981)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.Directories.getTrueAllocatedSizeIn(Directories.java:893)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.Directories.trueSnapshotsSize(Directories.java:883) 
> ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.ColumnFamilyStore.trueSnapshotsSize(ColumnFamilyStore.java:2332)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.metrics.TableMetrics$32.getValue(TableMetrics.java:637) 
> ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.metrics.TableMetrics$32.getValue(TableMetrics.java:634) 
> ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> com.codahale.metrics.graphite.GraphiteReporter.reportGauge(GraphiteReporter.java:281)
>  ~[metrics-graphite-3.1.0.jar:3.1.0]
>       at 
> com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:158)
>  ~[metrics-graphite-3.1.0.jar:3.1.0]
>       at 
> com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162) 
> ~[metrics-core-3.1.0.jar:3.1.0]
>       at 
> com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117) 
> ~[metrics-core-3.1.0.jar:3.1.0]
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [na:1.8.0_91]
>       at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) 
> [na:1.8.0_91]
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  [na:1.8.0_91]
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  [na:1.8.0_91]
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [na:1.8.0_91]
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [na:1.8.0_91]
>       at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
> Caused by: java.lang.RuntimeException: java.nio.file.FileSystemException: 
> /data/cassandra/data/keyschema/columnfamily-3996ce80b7ac11e48a9b6776bf484396/ma_txn_anticompactionafterrepair_f20b50d0-22bd-11e6-970f-6f22464f4624.log:
>  Too many open files
>       at org.apache.cassandra.io.util.FileUtils.readLines(FileUtils.java:622) 
> ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1321) 
> ~[na:1.8.0_91]
>       at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169) 
> ~[na:1.8.0_91]
>       at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) 
> ~[na:1.8.0_91]
>       at java.util.Iterator.forEachRemaining(Iterator.java:116) ~[na:1.8.0_91]
>       at 
> java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
>  ~[na:1.8.0_91]
>       at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) 
> ~[na:1.8.0_91]
>       at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) 
> ~[na:1.8.0_91]
>       at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) 
> ~[na:1.8.0_91]
>       at 
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) 
> ~[na:1.8.0_91]
>       at 
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) 
> ~[na:1.8.0_91]
>       at 
> org.apache.cassandra.db.lifecycle.LogReplicaSet.readRecords(LogReplicaSet.java:105)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at org.apache.cassandra.db.lifecycle.LogFile.verify(LogFile.java:132) 
> ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.lifecycle.LogAwareFileLister.readTxnLog(LogAwareFileLister.java:113)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.lifecycle.LogAwareFileLister.classifyFiles(LogAwareFileLister.java:106)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) 
> ~[na:1.8.0_91]
>       at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) 
> ~[na:1.8.0_91]
>       at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
>  ~[na:1.8.0_91]
>       at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) 
> ~[na:1.8.0_91]
>       at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) 
> ~[na:1.8.0_91]
>       at 
> java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) 
> ~[na:1.8.0_91]
>       at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
>  ~[na:1.8.0_91]
>       at 
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) 
> ~[na:1.8.0_91]
>       at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) 
> ~[na:1.8.0_91]
>       at 
> org.apache.cassandra.db.lifecycle.LogAwareFileLister.innerList(LogAwareFileLister.java:75)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.lifecycle.LogAwareFileLister.list(LogAwareFileLister.java:53)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       ... 20 common frames omitted
> Caused by: java.nio.file.FileSystemException: 
> /data/cassandra/data/keyschema/columnfamily-3996ce80b7ac11e48a9b6776bf484396/ma_txn_anticompactionafterrepair_f20b50d0-22bd-11e6-970f-6f22464f4624.log:
>  Too many open files
>       at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) 
> ~[na:1.8.0_91]
>       at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) 
> ~[na:1.8.0_91]
>       at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) 
> ~[na:1.8.0_91]
>       at 
> sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
>  ~[na:1.8.0_91]
>       at java.nio.file.Files.newByteChannel(Files.java:361) ~[na:1.8.0_91]
>       at java.nio.file.Files.newByteChannel(Files.java:407) ~[na:1.8.0_91]
>       at 
> java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
>  ~[na:1.8.0_91]
>       at java.nio.file.Files.newInputStream(Files.java:152) ~[na:1.8.0_91]
>       at java.nio.file.Files.newBufferedReader(Files.java:2784) ~[na:1.8.0_91]
>       at java.nio.file.Files.readAllLines(Files.java:3202) ~[na:1.8.0_91]
>       at org.apache.cassandra.io.util.FileUtils.readLines(FileUtils.java:615) 
> ~[apache-cassandra-3.0.6.jar:3.0.6]
>       ... 45 common frames omitted
> WARN  [HintsWriteExecutor:1] 2016-05-25 21:54:28,389 CLibrary.java:256 - 
> open(/var/lib/cassandra/hints, O_RDONLY) failed, errno (24).
> ERROR [CompactionExecutor:2660] 2016-05-25 21:54:29,915 
> JVMStabilityInspector.java:139 - JVM state determined to be unstable.  
> Exiting forcefully due to:
> java.io.FileNotFoundException: 
> /data/cassandra/data/keyschema/columnfamily-3996ce80b7ac11e48a9b6776bf484396/ma-92661-big-Filter.db
>  (Too many open files)
>       at java.io.FileOutputStream.open0(Native Method) ~[na:1.8.0_91]
>       at java.io.FileOutputStream.open(FileOutputStream.java:270) 
> ~[na:1.8.0_91]
>       at java.io.FileOutputStream.<init>(FileOutputStream.java:213) 
> ~[na:1.8.0_91]
>       at java.io.FileOutputStream.<init>(FileOutputStream.java:101) 
> ~[na:1.8.0_91]
>       at 
> org.apache.cassandra.io.sstable.format.big.BigTableWriter$IndexWriter.flushBf(BigTableWriter.java:431)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.io.sstable.format.big.BigTableWriter$IndexWriter.doPrepare(BigTableWriter.java:461)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:173)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.io.sstable.format.big.BigTableWriter$TransactionalProxy.doPrepare(BigTableWriter.java:303)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:173)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.io.sstable.format.SSTableWriter.prepareToCommit(SSTableWriter.java:229)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.io.sstable.SSTableRewriter.doPrepare(SSTableRewriter.java:357)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:173)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.compaction.CompactionManager.antiCompactGroup(CompactionManager.java:1303)
>  [apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.compaction.CompactionManager.doAntiCompaction(CompactionManager.java:1232)
>  [apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.compaction.CompactionManager.performAnticompaction(CompactionManager.java:569)
>  [apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.compaction.CompactionManager$7.runMayThrow(CompactionManager.java:479)
>  [apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
> [apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [na:1.8.0_91]
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> [na:1.8.0_91]
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [na:1.8.0_91]
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> [na:1.8.0_91]
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [na:1.8.0_91]
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [na:1.8.0_91]
>       at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
> {code}
> Seems like that by using the -seq option, C* just opened wat too 
> many files. The limits are currently quite ok though, as showed below:
> {code:xml}
> root@node3:/# cat /etc/security/limits.d/cassandra.conf 
> # Provided by the cassandra package
> cassandra  -  memlock  unlimited
> cassandra  -  nofile   100000
> cassandra  -  as       unlimited
> cassandra  -  nproc    8096
> {code}
> Today I checked the number of file opened by C* (admittedly while running 
> another repair without the -seq option):
> {code:xml}
> pssh --inline-stdout -h cassandra-hosts.txt -l root -t 0 "lsof -n | grep java 
> | grep cassandra | wc -l"
> [1] 14:29:43 [SUCCESS] node3 (where the repair is taking place)
> 25452
> [2] 14:29:43 [SUCCESS] node4
> 46779
> pssh --inline-stdout -h cassandra-hosts.txt -l root -t 0 "lsof -n | grep java 
> | grep cassandra | grep 'REG' | wc -l"
> [1] 14:26:40 [SUCCESS] node3 (where the repair is taking place)
> 24020
> [2] 14:26:41 [SUCCESS] node4
> 4050
> {code}
> Looks like that during a repair, lots of file descriptors on the repairing 
> node 
> relate to regular files, while on all others the stark majority relate to 
> memory 
> mapped files.
> Nevertheless 25K is still way less then 100K (still high though). 
> (To be honest, when I started the repair nodetool logged it with "Starting 
> repair 
> command #2". I didn't pay that much attention to that as the previous repair
> had returned already for quite some time, but it might have been the case 
> that 
> another repair was still lingering (weird if that was the case).
> In conclusion, is this behavior normal? Shall I increase the limits, or is 
> AntiCompaction really playing badly with incremental repairs and LCS 
> (and quite overloaded to be honest) tables?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to