We've applied a fix to the 0.7 branch in https://issues.apache.org/jira/browse/CASSANDRA-2714. The patch probably applies to 0.7.6 as well.
On Thu, May 26, 2011 at 11:36 AM, Flavio Baronti <f.baro...@list-group.com> wrote: > I tried the manual copy you suggest, but the SystemTable.checkHealth() > function > complains it can't load the system files. Log follows, I will gather some > more > info and create a ticket as soon as possible. > > INFO [main] 2011-05-26 18:25:36,147 AbstractCassandraDaemon.java Logging > initialized > INFO [main] 2011-05-26 18:25:36,172 AbstractCassandraDaemon.java Heap size: > 4277534720/4277534720 > INFO [main] 2011-05-26 18:25:36,174 CLibrary.java JNA not found. Native > methods will be disabled. > INFO [main] 2011-05-26 18:25:36,190 DatabaseDescriptor.java Loading > settings from file:/C:/Cassandra/conf/hscassandra9170.yaml > INFO [main] 2011-05-26 18:25:36,344 DatabaseDescriptor.java DiskAccessMode > 'auto' determined to be mmap, indexAccessMode is mmap > INFO [main] 2011-05-26 18:25:36,532 SSTableReader.java Opening > G:\Cassandra\data\system\Schema-f-2746 > INFO [main] 2011-05-26 18:25:36,577 SSTableReader.java Opening > G:\Cassandra\data\system\Schema-f-2729 > INFO [main] 2011-05-26 18:25:36,590 SSTableReader.java Opening > G:\Cassandra\data\system\Schema-f-2745 > INFO [main] 2011-05-26 18:25:36,599 SSTableReader.java Opening > G:\Cassandra\data\system\Migrations-f-2167 > INFO [main] 2011-05-26 18:25:36,600 SSTableReader.java Opening > G:\Cassandra\data\system\Migrations-f-2131 > INFO [main] 2011-05-26 18:25:36,602 SSTableReader.java Opening > G:\Cassandra\data\system\Migrations-f-1041 > INFO [main] 2011-05-26 18:25:36,603 SSTableReader.java Opening > G:\Cassandra\data\system\Migrations-f-1695 > ERROR [main] 2011-05-26 18:25:36,634 AbstractCassandraDaemon.java Fatal > exception during initialization > org.apache.cassandra.config.ConfigurationException: Found system table > files, but they couldn't be loaded. Did you change the partitioner? > at > org.apache.cassandra.db.SystemTable.checkHealth(SystemTable.java:236) > at > org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:127) > at > org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:314) > at > org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:79) > > > Il 5/26/2011 6:04 PM, Jonathan Ellis ha scritto: >> >> Sounds like a legitimate bug, although looking through the code I'm >> not sure what would cause a tight retry loop on migration >> announce/rectify. Can you create a ticket at >> https://issues.apache.org/jira/browse/CASSANDRA ? >> >> As a workaround, I would try manually copying the Migrations and >> Schema sstable files from the system keyspace of the live node, then >> restart the recovering one. >> >> On Thu, May 26, 2011 at 9:27 AM, Flavio Baronti >> <f.baro...@list-group.com> wrote: >>> >>> I can't seem to be able to recover a failed node on a database where i >>> did >>> many updates to the schema. >>> >>> I have a small cluster with 2 nodes, around 1000 CF (I know it's a lot, >>> but >>> it can't be changed right now), and ReplicationFactor=2. >>> I shut down a node and cleaned its data entirely, then tried to bring it >>> back up. The node starts fetching schema updates from the live node, but >>> the >>> operation fails halfway with an OOME. >>> After some investigation, what I found is that: >>> >>> - I have a lot of schema updates (there are 2067 rows in the >>> system.Schema >>> CF). >>> - The live node loads migrations 1-1000, and sends them to the recovering >>> node (Migration.getLocalMigrations()) >>> - Soon afterwards, the live node checks the schema version on the >>> recovering >>> node and finds it has moved by a little - say it has applied the first 3 >>> migrations. It then loads migrations 3-1003, and sends them to the node. >>> - This process is repeated very quickly (sends migrations 6-1006, 9-1009, >>> etc). >>> >>> Analyzing the memory dump and the logs, it looks like each of these 1000 >>> migration blocks are composed in a single message and sent to the >>> OutboundTcpConnection queue. However, since the schema is big, the >>> messages >>> occupy a lot of space, and are built faster than the connection can send >>> them. Therefore, they accumulate in OutboundTcpConnection.queue, until >>> memory is completely filled. >>> >>> Any suggestions? Can I change something to make this work, apart from >>> reducing the number of CFs? >>> >>> Flavio >>> >> >> >> > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com