Sounds like a legitimate bug, although looking through the code I'm not sure what would cause a tight retry loop on migration announce/rectify. Can you create a ticket at https://issues.apache.org/jira/browse/CASSANDRA ?
As a workaround, I would try manually copying the Migrations and Schema sstable files from the system keyspace of the live node, then restart the recovering one. On Thu, May 26, 2011 at 9:27 AM, Flavio Baronti <[email protected]> wrote: > I can't seem to be able to recover a failed node on a database where i did > many updates to the schema. > > I have a small cluster with 2 nodes, around 1000 CF (I know it's a lot, but > it can't be changed right now), and ReplicationFactor=2. > I shut down a node and cleaned its data entirely, then tried to bring it > back up. The node starts fetching schema updates from the live node, but the > operation fails halfway with an OOME. > After some investigation, what I found is that: > > - I have a lot of schema updates (there are 2067 rows in the system.Schema > CF). > - The live node loads migrations 1-1000, and sends them to the recovering > node (Migration.getLocalMigrations()) > - Soon afterwards, the live node checks the schema version on the recovering > node and finds it has moved by a little - say it has applied the first 3 > migrations. It then loads migrations 3-1003, and sends them to the node. > - This process is repeated very quickly (sends migrations 6-1006, 9-1009, > etc). > > Analyzing the memory dump and the logs, it looks like each of these 1000 > migration blocks are composed in a single message and sent to the > OutboundTcpConnection queue. However, since the schema is big, the messages > occupy a lot of space, and are built faster than the connection can send > them. Therefore, they accumulate in OutboundTcpConnection.queue, until > memory is completely filled. > > Any suggestions? Can I change something to make this work, apart from > reducing the number of CFs? > > Flavio > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
