[jira] [Commented] (CASSANDRA-8259) Add column family name when reporting OutOfMemory errors

Jacek Furmankiewicz (JIRA) Wed, 05 Nov 2014 12:31:04 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-8259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199010#comment-14199010
 ]


Jacek Furmankiewicz commented on CASSANDRA-8259:
------------------------------------------------

well, the problem is that there is TOO much log info.

We have let's say 10 app servers. They all started throwing an exception at the 
same time.

It is very hard to figure out which one of these exceptions was the root cause 
and which ones were just caused by Cassandra going down.

And honestly, the "building Thrift response" code should be smart enough to 
figure out it is about to bring the whole server down.

A simple check on the number of rows and columns returned and their size and 
maybe throwing a regular exception would have been a much better option than 
crashing the entire server.

I am not sure if the Cassandra team understand how a new technology like this  
looks like in the eyes if a conservative customer if a single query can crash 
it.

And the worse part is that if this query works fine on other customers, it's 
very data size dependent, which varies greatly between customers.
So it's not obvious that a particular query is a threat to the stability of the 
underlying DB.

We've had cases where multiple queries from servers at the same brought down 
the whole cluster (not just a single node). 

Telling the customer that it is much more stable because it is a distributed DB 
is a difficult argument to make after an event like that...

> Add column family name when reporting OutOfMemory errors
> --------------------------------------------------------
>
>                 Key: CASSANDRA-8259
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8259
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jacek Furmankiewicz
>
> When we get a Thrift error like this which causes a server crash:
> {noformat}
> ERROR [Thrift:33] 2014-11-05 17:36:07,486 CassandraDaemon.java (line 196)
> Exception in thread Thread[Thrift:33,5,main]
> java.lang.OutOfMemoryError: Java heap space
>         at java.util.Arrays.copyOf(Arrays.java:2271)
>         at java.io.ByteArrayOutputStream.grow
> (ByteArrayOutputStream.java:113)
>         at java.io.ByteArrayOutputStream.ensureCapacity
> (ByteArrayOutputStream.java:93)
>         at java.io.ByteArrayOutputStream.write
> (ByteArrayOutputStream.java:140)
>         at org.apache.thrift.transport.TFramedTransport.write
> (TFramedTransport.java:146)
>         at org.apache.thrift.protocol.TBinaryProtocol.writeBinary
> (TBinaryProtocol.java:183)
>         at org.apache.cassandra.thrift.Column$ColumnStandardScheme.write
> (Column.java:678)
>         at org.apache.cassandra.thrift.Column$ColumnStandardScheme.write
> (Column.java:611)
>         at org.apache.cassandra.thrift.Column.write(Column.java:538)
>         at org.apache.cassandra.thrift.ColumnOrSuperColumn
> $ColumnOrSuperColumnStandardScheme.write(ColumnOrSuperColumn.java:673)
>         at org.apache.cassandra.thrift.ColumnOrSuperColumn
> $ColumnOrSuperColumnStandardScheme.write(ColumnOrSuperColumn.java:607)
>         at org.apache.cassandra.thrift.ColumnOrSuperColumn.write
> (ColumnOrSuperColumn.java:517)
>         at org.apache.cassandra.thrift.Cassandra$multiget_slice_result
> $multiget_slice_resultStandardScheme.write(Cassandra.java:14559)
>         at org.apache.cassandra.thrift.Cassandra$multiget_slice_result
> $multiget_slice_resultStandardScheme.write(Cassandra.java:14463)
>         at org.apache.cassandra.thrift.Cassandra
> $multiget_slice_result.write(Cassandra.java:14393)
>         at org.apache.thrift.ProcessFunction.process
> (ProcessFunction.java:53)
>         at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>         at org.apache.cassandra.thrift.CustomTThreadPoolServer
> $WorkerProcess.run(CustomTThreadPoolServer.java:194)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker
> (ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run
> (ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
>  INFO [StorageServiceShutdownHook] 2014-11-05 17:36:07,488
> ThriftServer.java (line 141) Stop listening to thrift clients
> {noformat}
> we have no clue as to which column family was being queried. That makes it 
> extremely difficult to troubleshoot which query in a complex code base caused 
> this error.
> We have multiple servers and they all throw a NoAvailableHostException in 
> Astyanax at the same time, all in different parts of the code...so figuring 
> out the root cause is an exercise in frustration that takes many hours.
> At least listing the column family in this message would save us COUNTLESS 
> hours of troubleshooting.
> We're on 2.0.8, JDK 1.7, RHEL 6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8259) Add column family name when reporting OutOfMemory errors

Reply via email to