[ https://issues.apache.org/jira/browse/CASSANDRA-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494746#comment-14494746 ]
Ariel Weisberg commented on CASSANDRA-9120: ------------------------------------------- I am on board with the patch for 2.0 and 2.1, but it seems like it could be a little more accurate at calculating element size (I think it undercounts by a large amount). If you instantiate a toy minimal size IndexInfo and invoke memorySize() you should get a more accurate bound and can put that into a static final field. You can cache the result of maxMemory() in a static final field. Apparently it is implemented as a JNI call (unless it is also an intrinsic which I wouldn't expect). The last thing to think about is estimateMemorySizeForKeys(int). It's a lot of multiplication and division and some of the constants might not be fully evaluated if they are tucked into a MemoryLayoutSpecification. I don't want to get carried away trying to make it better, but I am uncomfortable because we don't have a suite of benchmarks that will run to let us know if this change has an impact. If you have a simple idea for how to make this approximation simpler to calculate that would be great. > OutOfMemoryError when read auto-saved cache (probably broken) > ------------------------------------------------------------- > > Key: CASSANDRA-9120 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9120 > Project: Cassandra > Issue Type: Bug > Environment: Linux > Reporter: Vladimir > Fix For: 3.0, 2.0.15, 2.1.5 > > > Found during tests on a 100 nodes cluster. After restart I found that one > node constantly crashes with OutOfMemory Exception. I guess that auto-saved > cache was corrupted and Cassandra can't recognize it. I see that similar > issues was already fixed (when negative size of some structure was read). > Does auto-saved cache have checksum? it'd help to reject corrupted cache at > the very beginning. > As far as I can see current code still have that problem. Stack trace is: > {code} > INFO [main] 2015-03-28 01:04:13,503 AutoSavingCache.java (line 114) reading > saved cache > /storage/core/loginsight/cidata/cassandra/saved_caches/system-sstable_activity-KeyCache-b.db > ERROR [main] 2015-03-28 01:04:14,718 CassandraDaemon.java (line 513) > Exception encountered during startup > java.lang.OutOfMemoryError: Java heap space > at java.util.ArrayList.<init>(Unknown Source) > at > org.apache.cassandra.db.RowIndexEntry$Serializer.deserialize(RowIndexEntry.java:120) > at > org.apache.cassandra.service.CacheService$KeyCacheSerializer.deserialize(CacheService.java:365) > at > org.apache.cassandra.cache.AutoSavingCache.loadSaved(AutoSavingCache.java:119) > at > org.apache.cassandra.db.ColumnFamilyStore.<init>(ColumnFamilyStore.java:262) > at > org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:421) > at > org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:392) > at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:315) > at org.apache.cassandra.db.Keyspace.<init>(Keyspace.java:272) > at org.apache.cassandra.db.Keyspace.open(Keyspace.java:114) > at org.apache.cassandra.db.Keyspace.open(Keyspace.java:92) > at > org.apache.cassandra.db.SystemKeyspace.checkHealth(SystemKeyspace.java:536) > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:261) > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496) > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585) > {code} > I looked at source code of Cassandra and see: > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.cassandra/cassandra-all/2.0.10/org/apache/cassandra/db/RowIndexEntry.java > 119 int entries = in.readInt(); > 120 List<IndexHelper.IndexInfo> columnsIndex = new > ArrayList<IndexHelper.IndexInfo>(entries); > It seems that value entries is invalid (negative) and it tries too allocate > an array with huge initial capacity and hits OOM. I have deleted saved_cache > directory and was able to start node correctly. We should expect that it may > happen in real world. Cassandra should be able to skip incorrect cached data > and run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)