Filed as https://issues.apache.org/jira/browse/CASSANDRA-2037

I can't see how the code would be correct as written but I'm usually
wrong about most things.

On Sun, Jan 23, 2011 at 12:14 PM, Erik Onnen <eon...@gmail.com> wrote:
> One of the developers will have to confirm but this looks like a bug
> to me. MessagingService is a singleton and there's a Multimap used for
> targets that isn't accessed in a thread safe manner.
>
> The thread dump would seem to confirm this, when you hammer what is
> ultimately a standard HashMap with multiple readers while a write
> happens, there's a race that can be triggered where all subsequent
> reads on that map will get stuck.
>
> Your thread dump would seem to indicate that is happening with many
> (>50) threads trying to read from what appears to be the same HashMap.
>
> "pool-1-thread-6451" prio=10 tid=0x00007fa5242c9000 nid=0x10f4
> runnable [0x00007fa52fde4000]
>   java.lang.Thread.State: RUNNABLE
>        at java.util.HashMap.get(HashMap.java:303)
>        at 
> com.google.common.collect.AbstractMultimap.getOrCreateCollection(AbstractMultimap.java:205)
>        at 
> com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:194)
>        at 
> com.google.common.collect.AbstractListMultimap.put(AbstractListMultimap.java:72)
>        at 
> com.google.common.collect.ArrayListMultimap.put(ArrayListMultimap.java:60)
>        at 
> org.apache.cassandra.net.MessagingService.sendRR(MessagingService.java:303)
>        at 
> org.apache.cassandra.service.StorageProxy.strongRead(StorageProxy.java:353)
>        at 
> org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:229)
>        at 
> org.apache.cassandra.thrift.CassandraServer.readColumnFamily(CassandraServer.java:98)
>        at 
> org.apache.cassandra.thrift.CassandraServer.get(CassandraServer.java:289)
>        at 
> org.apache.cassandra.thrift.Cassandra$Processor$get.process(Cassandra.java:2655)
>        at 
> org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555)
>        at 
> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
>
> When two threads get to this state, they won't ever come back and any
> other thread that tries to read that HashMap will also get wedged
> (looks like that's happened in your case).
>
> On Sun, Jan 23, 2011 at 10:11 AM, Thibaut Britz
> <thibaut.br...@trendiction.com> wrote:
>>
>> Hi,
>>
>> After a while some of my cassandra nodes (out of a pool of 20) get
>> unresponsive. All cores (8) are being used to 100% and uptime load
>> climbs to 80.
>>
>> I first suspected the gargabe collector to force the node to 100% cpu
>> usage , but the problem even occurs with the default garbage
>> collector. A nodetool info also reveals that the heap isn't even
>> completely used.
>>
>> The node has JNA enabled. There are no ERROR message in the logfile
>> (except CASSANDRA-1959 which happened a few days before. But I
>> restarted the node afterwards). Cassandra runs on non virtualized
>> hardware. (e.g. no amazon ec2). The version I use is 0.7.0. I haven't
>> applied any patches (e.g. CASSANDRA-1959).
>>
>> I attached the jstack of the process. (there are many many threads)
>>
>> Thanks,
>> Thibaut
>>
>>
>> ./nodetool -h localhost info
>> ccc
>> Load             : 1.66 GB
>> Generation No    : 1295609201
>> Uptime (seconds) : 196155
>> Heap Memory (MB) : 1795.55 / 3912.06
>>
>>
>> root      2870  730 57.4 6764500 4703712 ?     SLl  Jan21 23913:00
>> /usr/bin/java -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42
>> -Xms3997M -Xmx3997M -XX:+HeapDumpOnOutOfMemoryError -XX:+UseLargePages
>> -XX:+UseCompressedOops -Xss128k -XX:SurvivorRatio=8
>> -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75
>> -XX:+UseCMSInitiatingOccupancyOnly -Djava.net.preferIPv4Stack=true
>> -Dcom.sun.management.jmxremote.port=8080
>> -Dcom.sun.management.jmxremote.ssl=false
>> -Dcom.sun.management.jmxremote.authenticate=false
>> -Dlog4j.configuration=log4j-server.properties -cp
>> /software/cassandra/bin/../conf:/software/cassandra/bin/../build/classes:/software/cassandra/bin/../lib/antlr-3.1.3.jar:/software/cassandra/bin/../lib/apache-cassandra-0.7.0.jar:/software/cassandra/bin/../lib/avro-1.4.0-fixes.jar:/software/cassandra/bin/../lib/avro-1.4.0-sources-fixes.jar:/software/cassandra/bin/../lib/commons-cli-1.1.jar:/software/cassandra/bin/../lib/commons-codec-1.2.jar:/software/cassandra/bin/../lib/commons-collections-3.2.1.jar:/software/cassandra/bin/../lib/commons-lang-2.4.jar:/software/cassandra/bin/../lib/concurrentlinkedhashmap-lru-1.1.jar:/software/cassandra/bin/../lib/guava-r05.jar:/software/cassandra/bin/../lib/high-scale-lib.jar:/software/cassandra/bin/../lib/ivy-2.1.0.jar:/software/cassandra/bin/../lib/jackson-core-asl-1.4.0.jar:/software/cassandra/bin/../lib/jackson-mapper-asl-1.4.0.jar:/software/cassandra/bin/../lib/jetty-6.1.21.jar:/software/cassandra/bin/../lib/jetty-util-6.1.21.jar:/software/cassandra/bin/../lib/jline-0.9.94.jar:/software/cassandra/bin/../lib/jna.jar:/software/cassandra/bin/../lib/json-simple-1.1.jar:/software/cassandra/bin/../lib/jug-2.0.0.jar:/software/cassandra/bin/../lib/libthrift-0.5.jar:/software/cassandra/bin/../lib/log4j-1.2.16.jar:/software/cassandra/bin/../lib/servlet-api-2.5-20081211.jar:/software/cassandra/bin/../lib/slf4j-api-1.6.1.jar:/software/cassandra/bin/../lib/slf4j-log4j12-1.6.1.jar:/software/cassandra/bin/../lib/snakeyaml-1.6.jar
>> org.apache.cassandra.thrift.CassandraDaemon
>>
>>
>> The error log is full of GCInspector messages:
>>
>>  INFO [ScheduledTasks:1] 2011-01-23 18:53:45,545 GCInspector.java
>> (line 184) table_x.table_x0,0                 0/0            0/200000
>>  INFO [ScheduledTasks:1] 2011-01-23 18:53:45,545 GCInspector.java
>> (line 184) table_x.table_x_meta                 0,0
>> 0/0            0/200000
>>  INFO [ScheduledTasks:1] 2011-01-23 18:53:45,545 GCInspector.java
>> (line 184) table_deduplicate.table_deduplicate_meta
>> 839,29805                 0/0            0/200000
>>  INFO [ScheduledTasks:1] 2011-01-23 18:53:45,546 GCInspector.java
>> (line 184) table_deduplicate.table_deduplicate          838,107630
>>            0/0        19328/200000
>>  WARN [ScheduledTasks:1] 2011-01-23 18:55:05,547 MessagingService.java
>> (line 545) Dropped 1850 READ messages in the last 5000ms
>>  WARN [ScheduledTasks:1] 2011-01-23 18:55:05,547 MessagingService.java
>> (line 545) Dropped 9603 RANGE_SLICE messages in the last 5000ms
>>  INFO [ScheduledTasks:1] 2011-01-23 18:55:05,547 GCInspector.java
>> (line 146) Pool Name                    Active   Pending
>>  INFO [ScheduledTasks:1] 2011-01-23 18:55:05,548 GCInspector.java
>> (line 161) ReadStage                         0         0
>>  INFO [ScheduledTasks:1] 2011-01-23 18:55:05,548 GCInspector.java
>> (line 161) RequestResponseStage              8    310732
>>  INFO [ScheduledTasks:1] 2011-01-23 18:55:05,548 GCInspector.java
>> (line 161) ReadRepair                        0         0
>>  INFO [ScheduledTasks:1] 2011-01-23 18:55:05,548 GCInspector.java
>> (line 161) MutationStage                     0         0
>>  INFO [ScheduledTasks:1] 2011-01-23 18:55:05,549 GCInspector.java
>> (line 161) GossipStage                       0         0
>>  INFO [ScheduledTasks:1] 2011-01-23 18:55:05,549 GCInspector.java
>> (line 161) AntiEntropyStage                  0         0
>>  INFO [ScheduledTasks:1] 2011-01-23 18:55:05,549 GCInspector.java
>> (line 161) MigrationStage                    0         0
>>  INFO [ScheduledTasks:1] 2011-01-23 18:55:05,549 GCInspector.java
>> (line 161) StreamStage                       0         0
>>  INFO [ScheduledTasks:1] 2011-01-23 18:55:05,549 GCInspector.java
>> (line 161) MemtablePostFlusher               0         0
>>  INFO [ScheduledTasks:1] 2011-01-23 18:55:05,550 GCInspector.java
>> (line 161) FlushWriter                       0         0
>>  INFO [ScheduledTasks:1] 2011-01-23 18:55:05,550 GCInspector.java
>> (line 161) MiscStage                         0         0
>>  INFO [ScheduledTasks:1] 2011-01-23 18:55:05,550 GCInspector.java
>> (line 161) FlushSorter                       0         0
>>  INFO [ScheduledTasks:1] 2011-01-23 18:55:05,550 GCInspector.java
>> (line 161) InternalResponseStage             0         0
>>  INFO [ScheduledTasks:1] 2011-01-23 18:55:05,550 GCInspector.java
>> (line 161) HintedHandoff                     0         0
>>  INFO [ScheduledTasks:1] 2011-01-23 18:55:05,551 GCInspector.java
>> (line 165) CompactionManager               n/a         0
>>  INFO [ScheduledTasks:1] 2011-01-23 18:55:05,551 GCInspector.java
>> (line 177) MessagingService                n/a       0,0
>>  INFO [ScheduledTasks:1] 2011-01-23 18:55:05,551 GCInspector.java
>> (line 181) ColumnFamily                Memtable ops,data  Row cache
>> size/cap  Key cache size/cap
>

Reply via email to