Re: Node OOM Problems

Wayne Fri, 20 Aug 2010 11:13:36 -0700

I deleted ALL data and reset the nodes from scratch. There are no more large
rows in there. 8-9megs MAX across all nodes. This appears to be a new
problem. I restarted the node in question and it seems to be running fine,
but I had to run repair on it as it appears to be missing a lot of data.



On Fri, Aug 20, 2010 at 7:51 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote:

> On Fri, Aug 20, 2010 at 1:17 PM, Wayne <wav...@gmail.com> wrote:
> > I turned off the creation of the secondary indexes which had the large
> rows
> > and all seemed good. Thank you for the help. I was getting
> > 60k+/writes/second on the 6 node cluster.
> >
> > Unfortunately again three hours later a node went down. I can not even
> look
> > at the logs when it started since they are gone/recycled due to millions
> of
> > message deserialization messages. What are these? The node had 12,098,067
> > pending message-deserializer-pool entries in tpstats. The node was up
> > according to some nodes and down according to others which made it
> flapping
> > and still trying to take requests. What is the log warning message
> > deserialization task dropped message? Why would a node have 12 million of
> > these?
> >
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
> > MessageDeserializationTask.java (line 47) dropping message (1,078,378ms
> past
> > timeout)
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
> > MessageDeserializationTask.java (line 47) dropping message (1,078,378ms
> past
> > timeout)
> >
> > I do not think this is a large row problem any more. All nodes show a max
> > row size around 8-9 megs.
> >
> > I looked at the munin charts and the disk IO seems to have spiked along
> with
> > compaction. Could compaction kicking in cause this? I have added the 3
> JVM
> > settings to make compaction a lower priority. Did this help cause this to
> > happen by slowing down and building up compaction on a heavily loaded
> > system?
> >
> > Thanks in advance for any help someone can provide.
> >
> >
> > On Fri, Aug 20, 2010 at 8:34 AM, Wayne <wav...@gmail.com> wrote:
> >>
> >> The NullPointerException does not crash the node. It only makes it
> flap/go
> >> down a for short period and then it comes back up. I do not see anything
> >> abnormal in the system log, only that single error in the cassandra.log.
> >>
> >>
> >> On Thu, Aug 19, 2010 at 11:42 PM, Peter Schuller
> >> <peter.schul...@infidyne.com> wrote:
> >>>
> >>> > What is my "live set"?
> >>>
> >>> Sorry; that meant the "set of data acually live (i.e., not garbage) in
> >>> the heap". In other words, the amount of memory truly "used".
> >>>
> >>> > Is the system CPU bound given the few statements
> >>> > below? This is from running 4 concurrent processes against the
> >>> > node...do I
> >>> > need to throttle back the concurrent read/writers?
> >>> >
> >>> > I do all reads/writes as Quorum. (Replication factor of 3).
> >>>
> >>> With quorom and 0.6.4 I don't think unthrottled writes are expected to
> >>> cause a problem.
> >>>
> >>> > The memtable threshold is the default of 256.
> >>> >
> >>> > All caching is turned off.
> >>> >
> >>> > The database is pretty small, maybe a few million keys (2-3) in 4
> CFs.
> >>> > The
> >>> > key size is pretty small. Some of the rows are pretty fat though
> >>> > (fatter
> >>> > than I thought). I am saving secondary indexes in separate CFs and
> >>> > those are
> >>> > the large rows that I think might be part of the problem. I will
> >>> > restart
> >>> > testing turning these off and see if I see any difference.
> >>> >
> >>> > Would an extra fat row explain repeated OOM crashes in a row? I have
> >>> > finally
> >>> > got the system to stabilize relatively and I even ran compaction on
> the
> >>> > bad
> >>> > node without a problem (still no row size stats).
> >>>
> >>> Based on what you've said so far, the large rows are the only thing I
> >>> would suspect may be the cause. With the amount of data and keys you
> >>> say you have, you should definitely not be having memory issues with
> >>> an 8 gig heap as a direct result of the data size/key count. A few
> >>> million keys is not a lot at all; I still claim you should be able to
> >>> handle hundreds of millions at least, from the perspective of bloom
> >>> filters and such.
> >>>
> >>> So your plan to try it without these large rows is probably a good
> >>> idea unless some else has a better idea.
> >>>
> >>> You may want to consider trying 0.7 betas too since it has removed the
> >>> limitation with respect to large rows, assuming you do in fact want
> >>> these large rows (see the CassandraLimitations wiki page that was
> >>> posted earlier in this thread).
> >>>
> >>> > I now have several other nodes flapping with the following single
> error
> >>> > in
> >>> > the cassandra.log
> >>> > Error: Exception thrown by the agent : java.lang.NullPointerException
> >>> >
> >>> > I assume this is an unrelated problem?
> >>>
> >>> Do you have a full stack trace?
> >>>
> >>> --
> >>> / Peter Schuller
> >>
> >
> >
>
> Just because you are no longer creating the big rows does not mean
> they are no longer effecting you. For example, periodic compaction may
> run on those keys. Did you delete the keys, and run a major compaction
> to clear the data and tombstones?
>

Re: Node OOM Problems

Reply via email to