[ 
https://issues.apache.org/jira/browse/CASSANDRA-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518160#comment-14518160
 ] 

Ariel Weisberg commented on CASSANDRA-9120:
-------------------------------------------

[~kuzminva] do you think silently skipping the cache is something we should do 
or should we require operators to delete it themselves?

I feel like trying to guess what heap size is "correct" can result in skipping 
the cache when the operator would rather we didn't. That would be making 
shrinking the heap synonymous with not loading the cache.

I feel like the best thing we can do without a checksum is look to provide good 
error messages that point the operator to the next step. If cache loading fails 
I think we should dump a catch all error message detailing the issues around 
corrupted caches as well as changes in heap size and point out that the 
operator can move/remove the caches.

> OutOfMemoryError when read auto-saved cache (probably broken)
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-9120
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9120
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Linux
>            Reporter: Vladimir
>             Fix For: 3.0, 2.0.15, 2.1.5
>
>
> Found during tests on a 100 nodes cluster. After restart I found that one 
> node constantly crashes with OutOfMemory Exception. I guess that auto-saved 
> cache was corrupted and Cassandra can't recognize it. I see that similar 
> issues was already fixed (when negative size of some structure was read). 
> Does auto-saved cache have checksum? it'd help to reject corrupted cache at 
> the very beginning.
> As far as I can see current code still have that problem. Stack trace is:
> {code}
> INFO [main] 2015-03-28 01:04:13,503 AutoSavingCache.java (line 114) reading 
> saved cache 
> /storage/core/loginsight/cidata/cassandra/saved_caches/system-sstable_activity-KeyCache-b.db
> ERROR [main] 2015-03-28 01:04:14,718 CassandraDaemon.java (line 513) 
> Exception encountered during startup
> java.lang.OutOfMemoryError: Java heap space
>         at java.util.ArrayList.<init>(Unknown Source)
>         at 
> org.apache.cassandra.db.RowIndexEntry$Serializer.deserialize(RowIndexEntry.java:120)
>         at 
> org.apache.cassandra.service.CacheService$KeyCacheSerializer.deserialize(CacheService.java:365)
>         at 
> org.apache.cassandra.cache.AutoSavingCache.loadSaved(AutoSavingCache.java:119)
>         at 
> org.apache.cassandra.db.ColumnFamilyStore.<init>(ColumnFamilyStore.java:262)
>         at 
> org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:421)
>         at 
> org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:392)
>         at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:315)
>         at org.apache.cassandra.db.Keyspace.<init>(Keyspace.java:272)
>         at org.apache.cassandra.db.Keyspace.open(Keyspace.java:114)
>         at org.apache.cassandra.db.Keyspace.open(Keyspace.java:92)
>         at 
> org.apache.cassandra.db.SystemKeyspace.checkHealth(SystemKeyspace.java:536)
>         at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:261)
>         at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496)
>         at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585)
> {code}
> I looked at source code of Cassandra and see:
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.cassandra/cassandra-all/2.0.10/org/apache/cassandra/db/RowIndexEntry.java
> 119 int entries = in.readInt();
> 120 List<IndexHelper.IndexInfo> columnsIndex = new 
> ArrayList<IndexHelper.IndexInfo>(entries);
> It seems that value entries is invalid (negative) and it tries too allocate 
> an array with huge initial capacity and hits OOM. I have deleted saved_cache 
> directory and was able to start node correctly. We should expect that it may 
> happen in real world. Cassandra should be able to skip incorrect cached data 
> and run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to