[ https://issues.apache.org/jira/browse/CASSANDRA-8052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aleksey Yeschenko updated CASSANDRA-8052: ----------------------------------------- Fix Version/s: 3.x > OOMs from allocating large arrays when deserializing (e.g probably corrupted > EstimatedHistogram data) > ----------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-8052 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8052 > Project: Cassandra > Issue Type: Bug > Environment: linux > Reporter: Matt Byrd > Labels: OOM, checksum, corruption, oom, serialization > Fix For: 3.x > > > We've seen nodes with what are presumably corrupted sstables repeatedly OOM > on attempted startup with such a message: > {code} > java.lang.OutOfMemoryError: Java heap space > at > org.apache.cassandra.utils.EstimatedHistogram$EstimatedHistogramSerializer.deserialize(EstimatedHistogram.java:266) > > at > org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:292) > at > org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:282) > at > org.apache.cassandra.io.sstable.SSTableReader.openMetadata(SSTableReader.java:234) > at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:194) > at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:157) > at > org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:273) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > {code} > It's probably not a coincidence that it's throwing an exception here since > this seems to be the first byte of the file read. > Presumably the correct operational process is just to replace the node, > however I was wondering if generally we might want to validate lengths when > we deserialise things? > This could avoid allocating large byte buffers causing unpredictable OOMs and > instead throw an exception to be handled as appropriate. > In this particular instance, there is no need for an unduly large size for > the estimated histogram. > Admittedly things are slightly different in 2.1, though I suspect a similar > thing might have happened with: > {code} > int numComponents = in.readInt(); > // read toc > Map<MetadataType, Integer> toc = new HashMap<>(numComponents); > {code} > Doing a find usages of DataInputStream.readInt() reveals quite a few places > where an int is read in and then an ArrayList, array or map of that size is > created. > In some cases this size might validly vary over a java int, > or be in a performance critical or delicate piece of code where one doesn't > want such checks. > Also there are other checksums and mechanisms at play which make some input > less likely to be corrupted. > However, is it maybe worth a pass over instances of this type of input, to > try and avoid such cases where it makes sense? > Perhaps there are less likely but worse failure modes present and hidden? > E.g if the deserialisation is happens to be for a message sent to some or all > nodes say. -- This message was sent by Atlassian JIRA (v6.3.4#6332)