Hi All, This is the first time I see this. I am using Hector for a bulk load into a 3 node Cassandra 0.7.0 cluster. I have been doing this for a while now but this time the load was more intense compared to the ones before and it was running from a single client machine because I was afraid to overload the cluster.
My client blocked on a batch_insert for a very long time. Actually the call never returned. This happened during bulk load at night. In fact the client got auto-killed and restarted 20 times and every time it was blocked. After seeing this in the morning many hours later I ran new Java stress test and saw this: ./stress -o INSERT -e ONE -c 30 -n 100000 -l 1 -i 1 -t 1 Created keyspaces. Sleeping 1s for propagation. total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time 0,0,0,NaN,1 0,0,0,NaN,2 0,0,0,NaN,3 0,0,0,NaN,4 0,0,0,NaN,5 0,0,0,NaN,6 Then I used cli to insert a single column into CF Standard1 created by the test above. After this stress test was able to run. Here is what the log had in the morning - the time is European. This is likely related to the stress test. I am including this for completeness. INFO [MigrationStage:1] 2011-02-05 14:29:52,689 ColumnFamilyStore.java (line 648) switching in a fresh Memtable for Migrations at CommitLogContext(file='/data/social2/ commitlog/CommitLog-1296861890465.log', position=6798985) INFO [MigrationStage:1] 2011-02-05 14:29:52,689 ColumnFamilyStore.java (line 952) Enqueuing flush of Memtable-Migrations@24152491(9724 bytes, 1 operations) INFO [MigrationStage:1] 2011-02-05 14:29:52,689 ColumnFamilyStore.java (line 648) switching in a fresh Memtable for Schema at CommitLogContext(file='/data/social2/comm itlog/CommitLog-1296861890465.log', position=6798985) INFO [FlushWriter:1] 2011-02-05 14:29:52,689 Memtable.java (line 155) Writing Memtable-Migrations@24152491(9724 bytes, 1 operations) INFO [MigrationStage:1] 2011-02-05 14:29:52,689 ColumnFamilyStore.java (line 952) Enqueuing flush of Memtable-Schema@22965127(5481 bytes, 5 operations) INFO [FlushWriter:1] 2011-02-05 14:29:52,955 Memtable.java (line 162) Completed flushing /data/social2/data/system/Migrations-e-15-Data.db (9857 bytes) INFO [FlushWriter:1] 2011-02-05 14:29:52,955 Memtable.java (line 155) Writing Memtable-Schema@22965127(5481 bytes, 5 operations) ERROR [MutationStage:12] 2011-02-05 14:29:53,014 RowMutationVerbHandler.java (line 83) Error in row mutation org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find cfId=1009 ================ Here is a summary of some unusual activity on one of the nodes before the first blockage. This node is different from the other nodes - it is slower according to my stress tests due to non-Cassandra background and disk activity: - node had an OOM 10 minutes prior to the first blockage, - node restarted normally - replayed commit logs - lost contact with 2 other nodes for a few milliseconds - flushed one memtable with 640K operations - filled the same memtable with another 640K columns in 9 seconds ??? impossible ??? Maybe I am misreading the log below? INFO [MutationStage:4] 2011-02-05 00:18:14,437 ColumnFamilyStore.java (line 952) Enqueuing flush of Memtable-Domains@29765464(14150991 bytes, 640000 operations) INFO [MutationStage:4] 2011-02-05 00:18:23,743 ColumnFamilyStore.java (line 952) Enqueuing flush of Memtable-Domains@1573293(13948680 bytes, 640000 operations) - compacted 4 SSTables for this CF - compacted 5 SSTables for the same CF - very unusual - reported a large row - this CF does have a large row but it should not have been touched in this load. - while the above is happening CMS is reclaiming 1.2G several times - node lost contact with another node that seems under some pressure with high heap and no GC to speak of - the CF in question now receives only 10K columns in 4 minutes - the last activity is hinted hand off to the stressed node that reappeared. After the blockage started the only activity I see is all CFs being flushed due to time and all caches being saved. No other activity until the stress test in the morning. Thank you, Oleg