I have similar problem with cassandra 0.8.10. After digging a while I've found that my problem is somehow related to system page scanning activities. After turning on -Xloggc I've found high system cpu usage during ParNew. After looking into sar -B I've found this:
08:00:08 AM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff 08:00:08 AM 3916.08 962.47 1737.93 31.00 1556.87 0.00 6514.35 789.03 12.11 08:10:06 AM 4753.89 1654.53 502.34 37.57 1296.41 0.00 7074.27 942.17 13.32 08:20:09 AM 4961.67 1062.75 508.87 40.01 1194.93 0.00 7474.82 915.63 12.25 08:30:03 AM 5006.94 1397.18 500.92 39.95 1394.10 0.00 9155.64 1124.52 12.28 08:40:09 AM 5796.97 3312.25 507.73 40.44 1462.88 0.00 59539.87 1173.68 1.97 08:50:08 AM 2354.89 602.84 281.43 19.32 636.23 0.00 2126568.76 331.31 0.02 09:00:08 AM 4.83 73.72 90.16 0.06 61.83 0.00 3626444.63 10.90 0.00 09:10:08 AM 0.87 2.82 81.70 0.01 45.39 0.00 3563230.69 0.03 0.00 09:20:09 AM 0.01 4.12 82.28 0.00 46.45 0.00 3566248.02 0.02 0.00 09:30:07 AM 8.62 3.84 122.97 0.11 65.31 0.00 3569377.35 0.59 0.00 09:40:08 AM 0.49 154.61 84.49 0.01 53.06 0.00 3580618.13 4.20 0.00 09:50:08 AM 0.07 5.25 82.58 0.00 47.76 0.00 3564969.22 0.04 0.00 10:00:08 AM 38.25 205.50 200.00 0.02 161.27 0.00 3580132.24 0.25 0.00 10:10:08 AM 0.59 3.43 97.67 0.02 52.99 0.00 3562920.83 0.04 0.00 10:20:08 AM 1.61 5.74 83.32 0.03 48.05 0.00 3587643.92 0.07 0.00 10:30:08 AM 0.47 3.67 77.19 0.00 44.04 0.00 3568829.92 0.11 0.00 10:40:08 AM 0.26 34.01 77.61 0.01 65.30 0.00 3564740.06 20.00 0.00 10:50:06 AM 0.25 4.76 82.60 0.00 49.48 0.00 3562681.08 2.56 0.00 11:00:08 AM 0.90 3.18 85.22 0.02 46.19 0.00 3559971.42 0.26 0.00 11:10:08 AM 0.05 3.20 81.81 0.00 45.50 0.00 3571435.22 0.03 0.00 11:20:07 AM 0.06 5.09 82.39 0.00 46.57 0.00 3571160.15 0.05 0.00 11:30:08 AM 0.42 3.38 76.75 0.00 43.54 0.00 3550153.01 0.07 0.00 11:40:08 AM 2.25 34.13 80.73 0.02 48.81 0.00 3556773.99 0.16 0.00 11:50:08 AM 143.70 145.63 181.79 0.00 146.17 0.00 3559806.31 0.04 0.00 12:00:08 PM 8.93 4.34 1428.64 0.14 551.37 0.00 3571540.52 0.82 0.00 12:10:08 PM 29038.30 32157.76 2301.36 163.57 17789.40 12155.79 1555422.56 14316.18 0.91 12:20:08 PM 40005.26 33453.59 1175.56 119.07 19402.11 14067.52 27828.66 18016.59 43.00 12:30:08 PM 28625.65 19189.01 1197.00 124.72 12740.11 7576.56 11155.77 11423.98 60.99 12:40:08 PM 14569.77 8156.18 1197.94 115.99 11249.82 1115.44 1493.11 1704.10 65.33 12:50:09 PM 12663.07 12462.00 1159.15 91.80 2798.45 0.00 3701.85 1508.08 40.74 01:00:05 PM 12587.33 10603.46 1117.98 100.40 6279.41 3533.36 3904.89 5156.94 69.33 It's looks like system is busy for 3 hours looking for pages to free... There was no heavy compaction, unexpected high load - nothing special but strange kernel behaviour, long ParNew and message drops. Does your system behave that way? Regards, Wojtek On Tue, Jul 17, 2012 at 10:14 AM, Nikolay Kоvshov <nkovs...@yandex.ru> wrote: > > This is a cluster of 2 nodes, each having 8G of operating memory, > replicationfactor=2 > Write/read pressure is quite low and almost never exceeds 10/second > > From time to time (2-3 times in a month) I see GC activity in logs and for > this time cassandra stops responding to requests which results in a timeout > in upper-layer application. Total time of unavailability can be over 5 minues > (like in the following case) > > What can I do with that? Wiil it become much worse when my cluster grows up? > > INFO [GossipTasks:1] 2012-07-16 13:10:50,055 Gossiper.java (line 736) > InetAddress /10.220.50.9 is now dead. > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,056 GCInspector.java (line 123) > GC for ParNew: 391383 ms for 1 collections, 2025808488 used; max is 8464105472 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,067 StatusLogger.java (line 50) > Pool Name Active Pending Blocked > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,071 StatusLogger.java (line 65) > ReadStage 0 0 0 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,071 StatusLogger.java (line 65) > RequestResponseStage 0 0 0 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,072 StatusLogger.java (line 65) > ReadRepairStage 0 0 0 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,072 StatusLogger.java (line 65) > MutationStage 0 0 0 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,073 StatusLogger.java (line 65) > ReplicateOnWriteStage 0 0 0 > INFO [GossipStage:1] 2012-07-16 13:10:50,074 Gossiper.java (line 722) > InetAddress /10.220.50.9 is now UP > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,075 StatusLogger.java (line 65) > GossipStage 1 59 0 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,075 StatusLogger.java (line 65) > AntiEntropyStage 0 0 0 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,076 StatusLogger.java (line 65) > MigrationStage 0 0 0 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,076 StatusLogger.java (line 65) > StreamStage 0 0 0 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,077 StatusLogger.java (line 65) > MemtablePostFlusher 0 0 0 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,077 StatusLogger.java (line 65) > FlushWriter 0 0 0 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,077 StatusLogger.java (line 65) > MiscStage 0 0 0 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,078 StatusLogger.java (line 65) > InternalResponseStage 0 0 0 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,078 StatusLogger.java (line 65) > HintedHandoff 0 0 0 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,079 StatusLogger.java (line 69) > CompactionManager n/a 0 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,079 StatusLogger.java (line 81) > MessagingService n/a 0,0 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,080 StatusLogger.java (line 85) > ColumnFamily Memtable ops,data Row cache size/cap Key cache > size/cap > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,080 StatusLogger.java (line 88) > Keyspace1.PSS 36712,343842617 0/0 > 97995/1000000 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,085 StatusLogger.java (line 88) > Keyspace1.Standard1 128679,162567721 0/0 > 0/1000000 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,085 StatusLogger.java (line 88) > system.NodeIdInfo 0,0 0/0 > 0/1 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,086 StatusLogger.java (line 88) > system.IndexInfo 0,0 0/0 > 0/1 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,086 StatusLogger.java (line 88) > system.LocationInfo 0,0 0/0 > 1/2 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,086 StatusLogger.java (line 88) > system.Migrations 0,0 0/0 > 3/3 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,087 StatusLogger.java (line 88) > system.HintsColumnFamily 1,528 0/0 > 0/1 > INFO [ScheduledTasks:1] 2012-07-16 13:10:50,087 StatusLogger.java (line 88) > system.Schema 0,0 0/0 > 2/3 > -------- Завершение пересылаемого сообщения --------