Re: Node OOM Problems

Edward Capriolo Fri, 20 Aug 2010 10:51:40 -0700

On Fri, Aug 20, 2010 at 1:17 PM, Wayne <[email protected]> wrote:
> I turned off the creation of the secondary indexes which had the large rows
> and all seemed good. Thank you for the help. I was getting
> 60k+/writes/second on the 6 node cluster.
>
> Unfortunately again three hours later a node went down. I can not even look
> at the logs when it started since they are gone/recycled due to millions of
> message deserialization messages. What are these? The node had 12,098,067
> pending message-deserializer-pool entries in tpstats. The node was up
> according to some nodes and down according to others which made it flapping
> and still trying to take requests. What is the log warning message
> deserialization task dropped message? Why would a node have 12 million of
> these?
>
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
> MessageDeserializationTask.java (line 47) dropping message (1,078,378ms past
> timeout)
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
> MessageDeserializationTask.java (line 47) dropping message (1,078,378ms past
> timeout)
>
> I do not think this is a large row problem any more. All nodes show a max
> row size around 8-9 megs.
>
> I looked at the munin charts and the disk IO seems to have spiked along with
> compaction. Could compaction kicking in cause this? I have added the 3 JVM
> settings to make compaction a lower priority. Did this help cause this to
> happen by slowing down and building up compaction on a heavily loaded
> system?
>
> Thanks in advance for any help someone can provide.
>
>
> On Fri, Aug 20, 2010 at 8:34 AM, Wayne <[email protected]> wrote:
>>
>> The NullPointerException does not crash the node. It only makes it flap/go
>> down a for short period and then it comes back up. I do not see anything
>> abnormal in the system log, only that single error in the cassandra.log.
>>
>>
>> On Thu, Aug 19, 2010 at 11:42 PM, Peter Schuller
>> <[email protected]> wrote:
>>>
>>> > What is my "live set"?
>>>
>>> Sorry; that meant the "set of data acually live (i.e., not garbage) in
>>> the heap". In other words, the amount of memory truly "used".
>>>
>>> > Is the system CPU bound given the few statements
>>> > below? This is from running 4 concurrent processes against the
>>> > node...do I
>>> > need to throttle back the concurrent read/writers?
>>> >
>>> > I do all reads/writes as Quorum. (Replication factor of 3).
>>>
>>> With quorom and 0.6.4 I don't think unthrottled writes are expected to
>>> cause a problem.
>>>
>>> > The memtable threshold is the default of 256.
>>> >
>>> > All caching is turned off.
>>> >
>>> > The database is pretty small, maybe a few million keys (2-3) in 4 CFs.
>>> > The
>>> > key size is pretty small. Some of the rows are pretty fat though
>>> > (fatter
>>> > than I thought). I am saving secondary indexes in separate CFs and
>>> > those are
>>> > the large rows that I think might be part of the problem. I will
>>> > restart
>>> > testing turning these off and see if I see any difference.
>>> >
>>> > Would an extra fat row explain repeated OOM crashes in a row? I have
>>> > finally
>>> > got the system to stabilize relatively and I even ran compaction on the
>>> > bad
>>> > node without a problem (still no row size stats).
>>>
>>> Based on what you've said so far, the large rows are the only thing I
>>> would suspect may be the cause. With the amount of data and keys you
>>> say you have, you should definitely not be having memory issues with
>>> an 8 gig heap as a direct result of the data size/key count. A few
>>> million keys is not a lot at all; I still claim you should be able to
>>> handle hundreds of millions at least, from the perspective of bloom
>>> filters and such.
>>>
>>> So your plan to try it without these large rows is probably a good
>>> idea unless some else has a better idea.
>>>
>>> You may want to consider trying 0.7 betas too since it has removed the
>>> limitation with respect to large rows, assuming you do in fact want
>>> these large rows (see the CassandraLimitations wiki page that was
>>> posted earlier in this thread).
>>>
>>> > I now have several other nodes flapping with the following single error
>>> > in
>>> > the cassandra.log
>>> > Error: Exception thrown by the agent : java.lang.NullPointerException
>>> >
>>> > I assume this is an unrelated problem?
>>>
>>> Do you have a full stack trace?
>>>
>>> --
>>> / Peter Schuller
>>
>
>


Just because you are no longer creating the big rows does not mean
they are no longer effecting you. For example, periodic compaction may
run on those keys. Did you delete the keys, and run a major compaction
to clear the data and tombstones?

Re: Node OOM Problems

Reply via email to