Re: OOM only on one datacenter nodes

2020-04-06 Thread Reid Pinchback
Centos 6.10 is a bit aged as a production server O/S platform, and I recall 
some odd-ball interactions with hardware variations, particularly around 
high-priority memory and network cards.  How good is your O/S level metric 
monitoring?  Not beyond the realm of possibility that your memory issues are 
outside of the JVM.  It isn’t easy to tell you what to specifically look for, 
but I would begin with metrics around memory and swap.  If you don’t see high 
consistent memory use outside of the JVM usage, saves wasting time chasing down 
details that are unlikely to matter.  You need to be used to seeing what those 
metrics are normally like though, so you aren’t chasing phantoms.

I second Jeff’s feedback.  You need the information you need.  It seems 
counterproductive to not configure these nodes to do what you need.  A 
fundamental value of C* is the ability to bring nodes up and down without 
risking availability.  When your existing technology approach is part of why 
you can’t gather the data you need, it helps to give yourself permission to 
improve what you have so you don’t remain in that situation.


From: Surbhi Gupta 
Date: Monday, April 6, 2020 at 12:44 AM
To: "user@cassandra.apache.org" 
Cc: Reid Pinchback 
Subject: Re: OOM only on one datacenter nodes

Message from External Sender
We are using JRE and not JDK , hence not able to take heap dump .

On Sun, 5 Apr 2020 at 19:21, Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:

Set the jvm flags to heap dump on oom

Open up the result in a heap inspector of your preference (like yourkit or 
similar)

Find a view that counts objects by total retained size. Take a screenshot. Send 
that.




On Apr 5, 2020, at 6:51 PM, Surbhi Gupta 
mailto:surbhi.gupt...@gmail.com>> wrote:
I just checked, we have setup the Heapsize to be 31GB not 32GB in DC2.

I checked the CPU and RAM both are same on all the nodes in DC1 and DC2.
What specific parameter I should check on OS ?
We are using CentOS release 6.10.

Currently disk_access_modeis not set hence it is auto in our env. Should 
setting disk_access_mode  to mmap_index_only  will help ?

Thanks
Surbhi

On Sun, 5 Apr 2020 at 01:31, Alex Ott 
mailto:alex...@gmail.com>> wrote:
Have you set -Xmx32g ? In this case you may get significantly less
available memory because of switch to 64-bit references.  See
http://java-performance.info/over-32g-heap-java/<https://urldefense.proofpoint.com/v2/url?u=http-3A__java-2Dperformance.info_over-2D32g-2Dheap-2Djava_=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=J9_U9AIV95JbVJ-0c1OyjqGOmdLCltCRwMPnOsS7BCE=rB9HFbb7t-FJQZUGJNtN0wOPIGZj7Fn8cE271bR63HE=>
 for details, and set
slightly less than 32Gb

Reid Pinchback  at "Sun, 5 Apr 2020 00:50:43 +" wrote:
 RP> Surbi:

 RP> If you aren’t seeing connection activity in DC2, I’d check to see if the 
operations hitting DC1 are quorum ops instead of local quorum.  That
 RP> still wouldn’t explain DC2 nodes going down, but would at least explain 
them doing more work than might be on your radar right now.

 RP> The hint replay being slow to me sounds like you could be fighting GC.

 RP> You mentioned bumping the DC2 nodes to 32gb.  You might have already been 
doing this, but if not, be sure to be under 32gb, like 31gb.
 RP> Otherwise you’re using larger object pointers and could actually have less 
effective ability to allocate memory.

 RP> As the problem is only happening in DC2, then there has to be a thing that 
is true in DC2 that isn’t true in DC1.  A difference in hardware, a
 RP> difference in O/S version, a difference in networking config or physical 
infrastructure, a difference in client-triggered activity, or a
 RP> difference in how repairs are handled. Somewhere, there is a difference.  
I’d start with focusing on that.

 RP> From: Erick Ramirez 
mailto:erick.rami...@datastax.com>>
 RP> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
 RP> Date: Saturday, April 4, 2020 at 8:28 PM
 RP> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
 RP> Subject: Re: OOM only on one datacenter nodes

 RP> Message from External Sender

 RP> With a lack of heapdump for you to analyse, my hypothesis is that your DC2 
nodes are taking on traffic (from some client somewhere) but you're
 RP> just not aware of it. The hints replay is just a side-effect of the nodes 
getting overloaded.

 RP> To rule out my hypothesis in the first instance, my recommendation is to 
monitor the incoming connections to the nodes in DC2. If you don't
 RP> have monitoring in place, you could simply run netstat at regular 
intervals and go from there. Cheers!

 RP> GOT QUESTIONS? Apache Cassandra experts from the community and DataStax 
have answers

Re: OOM only on one datacenter nodes

2020-04-06 Thread Jeff Jirsa
Nobody is going to do any better than guessing without a heap histogram

I’ve got pretty good intuition with cassandra in real prod environments and can 
think of like 8-9 different possible causes, but none of them really stand out 
as likely enough to describe in detail (mybe the memtable deadlock on 
flush, or maybe repair coordination in that dc), but really need a heap

If it’s causing you enough pain to email a list, seems like it’s worth making 
the changes you need to make to get a heap and debug properly.


> On Apr 5, 2020, at 9:44 PM, Surbhi Gupta  wrote:
> 
> 
> We are using JRE and not JDK , hence not able to take heap dump .
> 
>> On Sun, 5 Apr 2020 at 19:21, Jeff Jirsa  wrote:
>> 
>> Set the jvm flags to heap dump on oom
>> 
>> Open up the result in a heap inspector of your preference (like yourkit or 
>> similar)
>> 
>> Find a view that counts objects by total retained size. Take a screenshot. 
>> Send that. 
>> 
>> 
>> 
>>>> On Apr 5, 2020, at 6:51 PM, Surbhi Gupta  wrote:
>>>> 
>>> 
>>> I just checked, we have setup the Heapsize to be 31GB not 32GB in DC2.
>>> 
>>> I checked the CPU and RAM both are same on all the nodes in DC1 and DC2.
>>> What specific parameter I should check on OS ?
>>> We are using CentOS release 6.10.
>>> 
>>> Currently disk_access_modeis not set hence it is auto in our env. Should 
>>> setting disk_access_mode  to mmap_index_only  will help ?
>>> 
>>> Thanks
>>> Surbhi
>>> 
>>>> On Sun, 5 Apr 2020 at 01:31, Alex Ott  wrote:
>>>> Have you set -Xmx32g ? In this case you may get significantly less
>>>> available memory because of switch to 64-bit references.  See
>>>> http://java-performance.info/over-32g-heap-java/ for details, and set
>>>> slightly less than 32Gb
>>>> 
>>>> Reid Pinchback  at "Sun, 5 Apr 2020 00:50:43 +" wrote:
>>>>  RP> Surbi:
>>>> 
>>>>  RP> If you aren’t seeing connection activity in DC2, I’d check to see if 
>>>> the operations hitting DC1 are quorum ops instead of local quorum.  That
>>>>  RP> still wouldn’t explain DC2 nodes going down, but would at least 
>>>> explain them doing more work than might be on your radar right now.
>>>> 
>>>>  RP> The hint replay being slow to me sounds like you could be fighting GC.
>>>> 
>>>>  RP> You mentioned bumping the DC2 nodes to 32gb.  You might have already 
>>>> been doing this, but if not, be sure to be under 32gb, like 31gb. 
>>>>  RP> Otherwise you’re using larger object pointers and could actually have 
>>>> less effective ability to allocate memory.
>>>> 
>>>>  RP> As the problem is only happening in DC2, then there has to be a thing 
>>>> that is true in DC2 that isn’t true in DC1.  A difference in hardware, a
>>>>  RP> difference in O/S version, a difference in networking config or 
>>>> physical infrastructure, a difference in client-triggered activity, or a
>>>>  RP> difference in how repairs are handled. Somewhere, there is a 
>>>> difference.  I’d start with focusing on that.
>>>> 
>>>>  RP> From: Erick Ramirez 
>>>>  RP> Reply-To: "user@cassandra.apache.org" 
>>>>  RP> Date: Saturday, April 4, 2020 at 8:28 PM
>>>>  RP> To: "user@cassandra.apache.org" 
>>>>  RP> Subject: Re: OOM only on one datacenter nodes
>>>> 
>>>>  RP> Message from External Sender
>>>> 
>>>>  RP> With a lack of heapdump for you to analyse, my hypothesis is that 
>>>> your DC2 nodes are taking on traffic (from some client somewhere) but 
>>>> you're
>>>>  RP> just not aware of it. The hints replay is just a side-effect of the 
>>>> nodes getting overloaded.
>>>> 
>>>>  RP> To rule out my hypothesis in the first instance, my recommendation is 
>>>> to monitor the incoming connections to the nodes in DC2. If you don't
>>>>  RP> have monitoring in place, you could simply run netstat at regular 
>>>> intervals and go from there. Cheers!
>>>> 
>>>>  RP> GOT QUESTIONS? Apache Cassandra experts from the community and 
>>>> DataStax have answers! Share your expertise on 
>>>> https://community.datastax.com/.
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> With best wishes,Alex Ott
>>>> Principal Architect, DataStax
>>>> http://datastax.com/
>>>> 
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>>> 


Re: OOM only on one datacenter nodes

2020-04-05 Thread Surbhi Gupta
We are using JRE and not JDK , hence not able to take heap dump .

On Sun, 5 Apr 2020 at 19:21, Jeff Jirsa  wrote:

>
> Set the jvm flags to heap dump on oom
>
> Open up the result in a heap inspector of your preference (like yourkit or
> similar)
>
> Find a view that counts objects by total retained size. Take a screenshot.
> Send that.
>
>
>
> On Apr 5, 2020, at 6:51 PM, Surbhi Gupta  wrote:
>
> 
> I just checked, we have setup the Heapsize to be 31GB not 32GB in DC2.
>
> I checked the CPU and RAM both are same on all the nodes in DC1 and DC2.
> What specific parameter I should check on OS ?
> We are using CentOS release 6.10.
>
> Currently disk_access_modeis not set hence it is auto in our env. Should
> setting disk_access_mode  to mmap_index_only  will help ?
>
> Thanks
> Surbhi
>
> On Sun, 5 Apr 2020 at 01:31, Alex Ott  wrote:
>
>> Have you set -Xmx32g ? In this case you may get significantly less
>> available memory because of switch to 64-bit references.  See
>> http://java-performance.info/over-32g-heap-java/ for details, and set
>> slightly less than 32Gb
>>
>> Reid Pinchback  at "Sun, 5 Apr 2020 00:50:43 +" wrote:
>>  RP> Surbi:
>>
>>  RP> If you aren’t seeing connection activity in DC2, I’d check to see if
>> the operations hitting DC1 are quorum ops instead of local quorum.  That
>>  RP> still wouldn’t explain DC2 nodes going down, but would at least
>> explain them doing more work than might be on your radar right now.
>>
>>  RP> The hint replay being slow to me sounds like you could be fighting
>> GC.
>>
>>  RP> You mentioned bumping the DC2 nodes to 32gb.  You might have already
>> been doing this, but if not, be sure to be under 32gb, like 31gb.
>>  RP> Otherwise you’re using larger object pointers and could actually
>> have less effective ability to allocate memory.
>>
>>  RP> As the problem is only happening in DC2, then there has to be a
>> thing that is true in DC2 that isn’t true in DC1.  A difference in
>> hardware, a
>>  RP> difference in O/S version, a difference in networking config or
>> physical infrastructure, a difference in client-triggered activity, or a
>>  RP> difference in how repairs are handled. Somewhere, there is a
>> difference.  I’d start with focusing on that.
>>
>>  RP> From: Erick Ramirez 
>>  RP> Reply-To: "user@cassandra.apache.org" 
>>  RP> Date: Saturday, April 4, 2020 at 8:28 PM
>>  RP> To: "user@cassandra.apache.org" 
>>  RP> Subject: Re: OOM only on one datacenter nodes
>>
>>  RP> Message from External Sender
>>
>>  RP> With a lack of heapdump for you to analyse, my hypothesis is that
>> your DC2 nodes are taking on traffic (from some client somewhere) but you're
>>  RP> just not aware of it. The hints replay is just a side-effect of the
>> nodes getting overloaded.
>>
>>  RP> To rule out my hypothesis in the first instance, my recommendation
>> is to monitor the incoming connections to the nodes in DC2. If you don't
>>  RP> have monitoring in place, you could simply run netstat at regular
>> intervals and go from there. Cheers!
>>
>>  RP> GOT QUESTIONS? Apache Cassandra experts from the community and
>> DataStax have answers! Share your expertise on
>> https://community.datastax.com/.
>>
>>
>>
>> --
>> With best wishes,Alex Ott
>> Principal Architect, DataStax
>> http://datastax.com/
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>


Re: OOM only on one datacenter nodes

2020-04-05 Thread Jeff Jirsa

Set the jvm flags to heap dump on oom

Open up the result in a heap inspector of your preference (like yourkit or 
similar)

Find a view that counts objects by total retained size. Take a screenshot. Send 
that. 



> On Apr 5, 2020, at 6:51 PM, Surbhi Gupta  wrote:
> 
> 
> I just checked, we have setup the Heapsize to be 31GB not 32GB in DC2.
> 
> I checked the CPU and RAM both are same on all the nodes in DC1 and DC2.
> What specific parameter I should check on OS ?
> We are using CentOS release 6.10.
> 
> Currently disk_access_modeis not set hence it is auto in our env. Should 
> setting disk_access_mode  to mmap_index_only  will help ?
> 
> Thanks
> Surbhi
> 
>> On Sun, 5 Apr 2020 at 01:31, Alex Ott  wrote:
>> Have you set -Xmx32g ? In this case you may get significantly less
>> available memory because of switch to 64-bit references.  See
>> http://java-performance.info/over-32g-heap-java/ for details, and set
>> slightly less than 32Gb
>> 
>> Reid Pinchback  at "Sun, 5 Apr 2020 00:50:43 +" wrote:
>>  RP> Surbi:
>> 
>>  RP> If you aren’t seeing connection activity in DC2, I’d check to see if 
>> the operations hitting DC1 are quorum ops instead of local quorum.  That
>>  RP> still wouldn’t explain DC2 nodes going down, but would at least explain 
>> them doing more work than might be on your radar right now.
>> 
>>  RP> The hint replay being slow to me sounds like you could be fighting GC.
>> 
>>  RP> You mentioned bumping the DC2 nodes to 32gb.  You might have already 
>> been doing this, but if not, be sure to be under 32gb, like 31gb. 
>>  RP> Otherwise you’re using larger object pointers and could actually have 
>> less effective ability to allocate memory.
>> 
>>  RP> As the problem is only happening in DC2, then there has to be a thing 
>> that is true in DC2 that isn’t true in DC1.  A difference in hardware, a
>>  RP> difference in O/S version, a difference in networking config or 
>> physical infrastructure, a difference in client-triggered activity, or a
>>  RP> difference in how repairs are handled. Somewhere, there is a 
>> difference.  I’d start with focusing on that.
>> 
>>  RP> From: Erick Ramirez 
>>  RP> Reply-To: "user@cassandra.apache.org" 
>>  RP> Date: Saturday, April 4, 2020 at 8:28 PM
>>  RP> To: "user@cassandra.apache.org" 
>>  RP> Subject: Re: OOM only on one datacenter nodes
>> 
>>  RP> Message from External Sender
>> 
>>  RP> With a lack of heapdump for you to analyse, my hypothesis is that your 
>> DC2 nodes are taking on traffic (from some client somewhere) but you're
>>  RP> just not aware of it. The hints replay is just a side-effect of the 
>> nodes getting overloaded.
>> 
>>  RP> To rule out my hypothesis in the first instance, my recommendation is 
>> to monitor the incoming connections to the nodes in DC2. If you don't
>>  RP> have monitoring in place, you could simply run netstat at regular 
>> intervals and go from there. Cheers!
>> 
>>  RP> GOT QUESTIONS? Apache Cassandra experts from the community and DataStax 
>> have answers! Share your expertise on https://community.datastax.com/.
>> 
>> 
>> 
>> -- 
>> With best wishes,Alex Ott
>> Principal Architect, DataStax
>> http://datastax.com/
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>> 


Re: OOM only on one datacenter nodes

2020-04-05 Thread Surbhi Gupta
I just checked, we have setup the Heapsize to be 31GB not 32GB in DC2.

I checked the CPU and RAM both are same on all the nodes in DC1 and DC2.
What specific parameter I should check on OS ?
We are using CentOS release 6.10.

Currently disk_access_modeis not set hence it is auto in our env. Should
setting disk_access_mode  to mmap_index_only  will help ?

Thanks
Surbhi

On Sun, 5 Apr 2020 at 01:31, Alex Ott  wrote:

> Have you set -Xmx32g ? In this case you may get significantly less
> available memory because of switch to 64-bit references.  See
> http://java-performance.info/over-32g-heap-java/ for details, and set
> slightly less than 32Gb
>
> Reid Pinchback  at "Sun, 5 Apr 2020 00:50:43 +" wrote:
>  RP> Surbi:
>
>  RP> If you aren’t seeing connection activity in DC2, I’d check to see if
> the operations hitting DC1 are quorum ops instead of local quorum.  That
>  RP> still wouldn’t explain DC2 nodes going down, but would at least
> explain them doing more work than might be on your radar right now.
>
>  RP> The hint replay being slow to me sounds like you could be fighting GC.
>
>  RP> You mentioned bumping the DC2 nodes to 32gb.  You might have already
> been doing this, but if not, be sure to be under 32gb, like 31gb.
>  RP> Otherwise you’re using larger object pointers and could actually have
> less effective ability to allocate memory.
>
>  RP> As the problem is only happening in DC2, then there has to be a thing
> that is true in DC2 that isn’t true in DC1.  A difference in hardware, a
>  RP> difference in O/S version, a difference in networking config or
> physical infrastructure, a difference in client-triggered activity, or a
>  RP> difference in how repairs are handled. Somewhere, there is a
> difference.  I’d start with focusing on that.
>
>  RP> From: Erick Ramirez 
>  RP> Reply-To: "user@cassandra.apache.org" 
>  RP> Date: Saturday, April 4, 2020 at 8:28 PM
>  RP> To: "user@cassandra.apache.org" 
>  RP> Subject: Re: OOM only on one datacenter nodes
>
>  RP> Message from External Sender
>
>  RP> With a lack of heapdump for you to analyse, my hypothesis is that
> your DC2 nodes are taking on traffic (from some client somewhere) but you're
>  RP> just not aware of it. The hints replay is just a side-effect of the
> nodes getting overloaded.
>
>  RP> To rule out my hypothesis in the first instance, my recommendation is
> to monitor the incoming connections to the nodes in DC2. If you don't
>  RP> have monitoring in place, you could simply run netstat at regular
> intervals and go from there. Cheers!
>
>  RP> GOT QUESTIONS? Apache Cassandra experts from the community and
> DataStax have answers! Share your expertise on
> https://community.datastax.com/.
>
>
>
> --
> With best wishes,Alex Ott
> Principal Architect, DataStax
> http://datastax.com/
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: OOM only on one datacenter nodes

2020-04-05 Thread Alex Ott
Have you set -Xmx32g ? In this case you may get significantly less
available memory because of switch to 64-bit references.  See
http://java-performance.info/over-32g-heap-java/ for details, and set
slightly less than 32Gb

Reid Pinchback  at "Sun, 5 Apr 2020 00:50:43 +" wrote:
 RP> Surbi:

 RP> If you aren’t seeing connection activity in DC2, I’d check to see if the 
operations hitting DC1 are quorum ops instead of local quorum.  That
 RP> still wouldn’t explain DC2 nodes going down, but would at least explain 
them doing more work than might be on your radar right now.

 RP> The hint replay being slow to me sounds like you could be fighting GC.

 RP> You mentioned bumping the DC2 nodes to 32gb.  You might have already been 
doing this, but if not, be sure to be under 32gb, like 31gb. 
 RP> Otherwise you’re using larger object pointers and could actually have less 
effective ability to allocate memory.

 RP> As the problem is only happening in DC2, then there has to be a thing that 
is true in DC2 that isn’t true in DC1.  A difference in hardware, a
 RP> difference in O/S version, a difference in networking config or physical 
infrastructure, a difference in client-triggered activity, or a
 RP> difference in how repairs are handled. Somewhere, there is a difference.  
I’d start with focusing on that.

 RP> From: Erick Ramirez 
 RP> Reply-To: "user@cassandra.apache.org" 
 RP> Date: Saturday, April 4, 2020 at 8:28 PM
 RP> To: "user@cassandra.apache.org" 
 RP> Subject: Re: OOM only on one datacenter nodes

 RP> Message from External Sender

 RP> With a lack of heapdump for you to analyse, my hypothesis is that your DC2 
nodes are taking on traffic (from some client somewhere) but you're
 RP> just not aware of it. The hints replay is just a side-effect of the nodes 
getting overloaded.

 RP> To rule out my hypothesis in the first instance, my recommendation is to 
monitor the incoming connections to the nodes in DC2. If you don't
 RP> have monitoring in place, you could simply run netstat at regular 
intervals and go from there. Cheers!

 RP> GOT QUESTIONS? Apache Cassandra experts from the community and DataStax 
have answers! Share your expertise on https://community.datastax.com/.



-- 
With best wishes,Alex Ott
Principal Architect, DataStax
http://datastax.com/

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: OOM only on one datacenter nodes

2020-04-04 Thread Reid Pinchback
Surbi:

If you aren’t seeing connection activity in DC2, I’d check to see if the 
operations hitting DC1 are quorum ops instead of local quorum.  That still 
wouldn’t explain DC2 nodes going down, but would at least explain them doing 
more work than might be on your radar right now.

The hint replay being slow to me sounds like you could be fighting GC.

You mentioned bumping the DC2 nodes to 32gb.  You might have already been doing 
this, but if not, be sure to be under 32gb, like 31gb.  Otherwise you’re using 
larger object pointers and could actually have less effective ability to 
allocate memory.

As the problem is only happening in DC2, then there has to be a thing that is 
true in DC2 that isn’t true in DC1.  A difference in hardware, a difference in 
O/S version, a difference in networking config or physical infrastructure, a 
difference in client-triggered activity, or a difference in how repairs are 
handled. Somewhere, there is a difference.  I’d start with focusing on that.


From: Erick Ramirez 
Reply-To: "user@cassandra.apache.org" 
Date: Saturday, April 4, 2020 at 8:28 PM
To: "user@cassandra.apache.org" 
Subject: Re: OOM only on one datacenter nodes

Message from External Sender
With a lack of heapdump for you to analyse, my hypothesis is that your DC2 
nodes are taking on traffic (from some client somewhere) but you're just not 
aware of it. The hints replay is just a side-effect of the nodes getting 
overloaded.

To rule out my hypothesis in the first instance, my recommendation is to 
monitor the incoming connections to the nodes in DC2. If you don't have 
monitoring in place, you could simply run netstat at regular intervals and go 
from there. Cheers!

GOT QUESTIONS? Apache Cassandra experts from the community and DataStax have 
answers! Share your expertise on 
https://community.datastax.com/<https://urldefense.proofpoint.com/v2/url?u=https-3A__community.datastax.com_=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=fGh0C8n6vv7LqulA6b4vVVTavsof7CZt6ESlDCr-uP8=oZjamTdUMswHVohvkHQQftZdYivh1qRAmRn1-dap_Uo=>.



Re: OOM only on one datacenter nodes

2020-04-04 Thread Erick Ramirez
With a lack of heapdump for you to analyse, my hypothesis is that your DC2
nodes are taking on traffic (from some client somewhere) but you're just
not aware of it. The hints replay is just a side-effect of the nodes
getting overloaded.

To rule out my hypothesis in the first instance, my recommendation is to
monitor the incoming connections to the nodes in DC2. If you don't have
monitoring in place, you could simply run netstat at regular intervals and
go from there. Cheers!

GOT QUESTIONS? Apache Cassandra experts from the community and DataStax
have answers! Share your expertise on https://community.datastax.com/.


Re: OOM after a while during compacting

2018-04-05 Thread Nate McCall
>
>
> - Heap size is set to 8GB
> - Using G1GC
> - I tried moving the memtable out of the heap. It helped but I still got
> an OOM last night
> - Concurrent compactors is set to 1 but it still happens and also tried
> setting throughput between 16 and 128, no changes.
>

That heap size is way to small for G1GC. Switch back to the defaults with
CMS. IME, G1 needs > 20g for *just* the JVM to see improvements (but this
also depends on workload and a few other factors). Stick with the CMS
defaults unless you have some evidence-based experiment to try.

Also worth noting that with a 1TB gp2 EBS volume, you only have 3k IOPS to
play with before you are subject to rate limiting. If you allocate a volume
greater than 3.33TB, you get 10K IOPS and the rate limiting goes away (you
can see this playing around with the EBS sizing in the AWS calculator:
http://calculator.s3.amazonaws.com/index.html). Another common mistake here
is accidentally putting the commitlog on the boot volume which has a super
low amount of IOPS given it's 64g (?iirc) by default.


Re: OOM after a while during compacting

2018-04-05 Thread Zsolt Pálmai
Yeah, they are pretty much unique but it's only a few requests per day so
hitting all the nodes would be fine for now.

2018-04-05 15:43 GMT+02:00 Evelyn Smith :

> Not sure if it differs for SASI Secondary Indexes but my understanding is
> it’s a bad idea to use high cardinality columns for Secondary Indexes.
> Not sure what your data model looks like but I’d assume UUID would have
> very high cardinality.
>
> If that’s the case it pretty much guarantees any query on the secondary
> index will hit all the nodes, which is what you want to avoid.
>
> Also Secondary Indexes are generally bad for Cassandra, if you don’t need
> them or there's a way around using them I’d go with that.
>
> Regards,
> Eevee.
>
>
> On 5 Apr 2018, at 11:27 pm, Zsolt Pálmai  wrote:
>
> Tried both (although with the biggest table) and the result is the same.
>
> I stumbled upon this jira issue: https://issues.apache.o
> rg/jira/browse/CASSANDRA-12662
> Since the sasi indexes I use are only helping in debugging (for now) I
> dropped them and it seems the tables get compacted now (at least it made it
> further then before and the jvm metrics look healthy).
>
> Still this is not ideal as it would be nice to have those secondary
> indexes :/ .
>
> The columns I indexed are basically uuids (so I can match the rows from
> other systems but this is usually triggered manually so performance loss is
> acceptable).
> Is there a recommended index to use here? Or setting
> the max_compaction_flush_memory_in_mb value? I saw that it can cause
> different kind of problems... Or the default secondary index?
>
> Thanks
>
>
>
> 2018-04-05 15:14 GMT+02:00 Evelyn Smith :
>
>> Probably a dumb question but it’s good to clarify.
>>
>> Are you compacting the whole keyspace or are you compacting tables one at
>> a time?
>>
>>
>> On 5 Apr 2018, at 9:47 pm, Zsolt Pálmai  wrote:
>>
>> Hi!
>>
>> I have a setup with 4 AWS nodes (m4xlarge - 4 cpu, 16gb ram, 1TB ssd
>> each) and when running the nodetool compact command on any of the servers I
>> get out of memory exception after a while.
>>
>> - Before calling the compact first I did a repair and before that there
>> was a bigger update on a lot of entries so I guess a lot of sstables were
>> created. The reapir created around ~250 pending compaction tasks, 2 of the
>> nodes I managed to finish with upgrading to a 2xlarge machine and twice the
>> heap (but running the compact on them manually also killed one :/ so this
>> isn't an ideal solution)
>>
>> Some more info:
>> - Version is the newest 3.11.2 with java8u116
>> - Using LeveledCompactionStrategy (we have mostly reads)
>> - Heap size is set to 8GB
>> - Using G1GC
>> - I tried moving the memtable out of the heap. It helped but I still got
>> an OOM last night
>> - Concurrent compactors is set to 1 but it still happens and also tried
>> setting throughput between 16 and 128, no changes.
>> - Storage load is 127Gb/140Gb/151Gb/155Gb
>> - 1 keyspace, 16 tables but there are a few SASI indexes on big tables.
>> - The biggest partition I found was 90Mb but that table has only 2
>> sstables attached and compacts in seconds. The rest is mostly 1 line
>> partition with a few 10KB of data.
>> - Worst SSTable case: SSTables in each level: [1, 20/10, 106/100, 15, 0,
>> 0, 0, 0, 0]
>>
>> In the metrics it looks something like this before dying:
>> https://ibb.co/kLhdXH
>>
>> What the heap dump looks like of the top objects: https://ibb.co/ctkyXH
>>
>> The load is usually pretty low, the nodes are almost idling (avg 500
>> reads/sec, 30-40 writes/sec with occasional few second spikes with >100
>> writes) and the pending tasks is also around 0 usually.
>>
>> Any ideas? I'm starting to run out of ideas. Maybe the secondary indexes
>> cause problems? I could finish some bigger compactions where there was no
>> index attached but I'm not sure 100% if this is the cause.
>>
>> Thanks,
>> Zsolt
>>
>>
>>
>>
>>
>
>


Re: OOM after a while during compacting

2018-04-05 Thread Evelyn Smith
Not sure if it differs for SASI Secondary Indexes but my understanding is it’s 
a bad idea to use high cardinality columns for Secondary Indexes. 
Not sure what your data model looks like but I’d assume UUID would have very 
high cardinality.

If that’s the case it pretty much guarantees any query on the secondary index 
will hit all the nodes, which is what you want to avoid.

Also Secondary Indexes are generally bad for Cassandra, if you don’t need them 
or there's a way around using them I’d go with that.

Regards,
Eevee.

> On 5 Apr 2018, at 11:27 pm, Zsolt Pálmai  wrote:
> 
> Tried both (although with the biggest table) and the result is the same. 
> 
> I stumbled upon this jira issue: 
> https://issues.apache.org/jira/browse/CASSANDRA-12662 
> 
> Since the sasi indexes I use are only helping in debugging (for now) I 
> dropped them and it seems the tables get compacted now (at least it made it 
> further then before and the jvm metrics look healthy). 
> 
> Still this is not ideal as it would be nice to have those secondary indexes 
> :/ . 
> 
> The columns I indexed are basically uuids (so I can match the rows from other 
> systems but this is usually triggered manually so performance loss is 
> acceptable). 
> Is there a recommended index to use here? Or setting the 
> max_compaction_flush_memory_in_mb value? I saw that it can cause different 
> kind of problems... Or the default secondary index?
> 
> Thanks
> 
> 
> 
> 2018-04-05 15:14 GMT+02:00 Evelyn Smith  >:
> Probably a dumb question but it’s good to clarify.
> 
> Are you compacting the whole keyspace or are you compacting tables one at a 
> time?
> 
> 
>> On 5 Apr 2018, at 9:47 pm, Zsolt Pálmai > > wrote:
>> 
>> Hi!
>> 
>> I have a setup with 4 AWS nodes (m4xlarge - 4 cpu, 16gb ram, 1TB ssd each) 
>> and when running the nodetool compact command on any of the servers I get 
>> out of memory exception after a while.
>> 
>> - Before calling the compact first I did a repair and before that there was 
>> a bigger update on a lot of entries so I guess a lot of sstables were 
>> created. The reapir created around ~250 pending compaction tasks, 2 of the 
>> nodes I managed to finish with upgrading to a 2xlarge machine and twice the 
>> heap (but running the compact on them manually also killed one :/ so this 
>> isn't an ideal solution)
>> 
>> Some more info: 
>> - Version is the newest 3.11.2 with java8u116
>> - Using LeveledCompactionStrategy (we have mostly reads)
>> - Heap size is set to 8GB
>> - Using G1GC
>> - I tried moving the memtable out of the heap. It helped but I still got an 
>> OOM last night
>> - Concurrent compactors is set to 1 but it still happens and also tried 
>> setting throughput between 16 and 128, no changes.
>> - Storage load is 127Gb/140Gb/151Gb/155Gb
>> - 1 keyspace, 16 tables but there are a few SASI indexes on big tables.
>> - The biggest partition I found was 90Mb but that table has only 2 sstables 
>> attached and compacts in seconds. The rest is mostly 1 line partition with a 
>> few 10KB of data.
>> - Worst SSTable case: SSTables in each level: [1, 20/10, 106/100, 15, 0, 0, 
>> 0, 0, 0]
>> 
>> In the metrics it looks something like this before dying: 
>> https://ibb.co/kLhdXH 
>> 
>> What the heap dump looks like of the top objects: https://ibb.co/ctkyXH 
>> 
>> 
>> The load is usually pretty low, the nodes are almost idling (avg 500 
>> reads/sec, 30-40 writes/sec with occasional few second spikes with >100 
>> writes) and the pending tasks is also around 0 usually.
>> 
>> Any ideas? I'm starting to run out of ideas. Maybe the secondary indexes 
>> cause problems? I could finish some bigger compactions where there was no 
>> index attached but I'm not sure 100% if this is the cause.
>> 
>> Thanks,
>> Zsolt
>> 
>> 
>> 
> 
> 



Re: OOM after a while during compacting

2018-04-05 Thread Zsolt Pálmai
Tried both (although with the biggest table) and the result is the same.

I stumbled upon this jira issue: https://issues.apache.
org/jira/browse/CASSANDRA-12662
Since the sasi indexes I use are only helping in debugging (for now) I
dropped them and it seems the tables get compacted now (at least it made it
further then before and the jvm metrics look healthy).

Still this is not ideal as it would be nice to have those secondary indexes
:/ .

The columns I indexed are basically uuids (so I can match the rows from
other systems but this is usually triggered manually so performance loss is
acceptable).
Is there a recommended index to use here? Or setting
the max_compaction_flush_memory_in_mb value? I saw that it can cause
different kind of problems... Or the default secondary index?

Thanks



2018-04-05 15:14 GMT+02:00 Evelyn Smith :

> Probably a dumb question but it’s good to clarify.
>
> Are you compacting the whole keyspace or are you compacting tables one at
> a time?
>
>
> On 5 Apr 2018, at 9:47 pm, Zsolt Pálmai  wrote:
>
> Hi!
>
> I have a setup with 4 AWS nodes (m4xlarge - 4 cpu, 16gb ram, 1TB ssd each)
> and when running the nodetool compact command on any of the servers I get
> out of memory exception after a while.
>
> - Before calling the compact first I did a repair and before that there
> was a bigger update on a lot of entries so I guess a lot of sstables were
> created. The reapir created around ~250 pending compaction tasks, 2 of the
> nodes I managed to finish with upgrading to a 2xlarge machine and twice the
> heap (but running the compact on them manually also killed one :/ so this
> isn't an ideal solution)
>
> Some more info:
> - Version is the newest 3.11.2 with java8u116
> - Using LeveledCompactionStrategy (we have mostly reads)
> - Heap size is set to 8GB
> - Using G1GC
> - I tried moving the memtable out of the heap. It helped but I still got
> an OOM last night
> - Concurrent compactors is set to 1 but it still happens and also tried
> setting throughput between 16 and 128, no changes.
> - Storage load is 127Gb/140Gb/151Gb/155Gb
> - 1 keyspace, 16 tables but there are a few SASI indexes on big tables.
> - The biggest partition I found was 90Mb but that table has only 2
> sstables attached and compacts in seconds. The rest is mostly 1 line
> partition with a few 10KB of data.
> - Worst SSTable case: SSTables in each level: [1, 20/10, 106/100, 15, 0,
> 0, 0, 0, 0]
>
> In the metrics it looks something like this before dying:
> https://ibb.co/kLhdXH
>
> What the heap dump looks like of the top objects: https://ibb.co/ctkyXH
>
> The load is usually pretty low, the nodes are almost idling (avg 500
> reads/sec, 30-40 writes/sec with occasional few second spikes with >100
> writes) and the pending tasks is also around 0 usually.
>
> Any ideas? I'm starting to run out of ideas. Maybe the secondary indexes
> cause problems? I could finish some bigger compactions where there was no
> index attached but I'm not sure 100% if this is the cause.
>
> Thanks,
> Zsolt
>
>
>
>
>


Re: OOM after a while during compacting

2018-04-05 Thread Evelyn Smith
Oh and second, are you attempting a major compact while you have all those 
pending compactions?

Try letting the cluster catch up on compactions. Having that many pending is 
bad.

If you have replication factor of 3 and quorum you could go node by node and 
disable binary, raise concurrent compactors to 4 and unthrottle compactions by 
setting throughput to zero. This can help it catch up on those compactions. 
Then you can deal with trying a major compaction.

Regards,
Evelyn.

> On 5 Apr 2018, at 11:14 pm, Evelyn Smith  wrote:
> 
> Probably a dumb question but it’s good to clarify.
> 
> Are you compacting the whole keyspace or are you compacting tables one at a 
> time?
> 
>> On 5 Apr 2018, at 9:47 pm, Zsolt Pálmai > > wrote:
>> 
>> Hi!
>> 
>> I have a setup with 4 AWS nodes (m4xlarge - 4 cpu, 16gb ram, 1TB ssd each) 
>> and when running the nodetool compact command on any of the servers I get 
>> out of memory exception after a while.
>> 
>> - Before calling the compact first I did a repair and before that there was 
>> a bigger update on a lot of entries so I guess a lot of sstables were 
>> created. The reapir created around ~250 pending compaction tasks, 2 of the 
>> nodes I managed to finish with upgrading to a 2xlarge machine and twice the 
>> heap (but running the compact on them manually also killed one :/ so this 
>> isn't an ideal solution)
>> 
>> Some more info: 
>> - Version is the newest 3.11.2 with java8u116
>> - Using LeveledCompactionStrategy (we have mostly reads)
>> - Heap size is set to 8GB
>> - Using G1GC
>> - I tried moving the memtable out of the heap. It helped but I still got an 
>> OOM last night
>> - Concurrent compactors is set to 1 but it still happens and also tried 
>> setting throughput between 16 and 128, no changes.
>> - Storage load is 127Gb/140Gb/151Gb/155Gb
>> - 1 keyspace, 16 tables but there are a few SASI indexes on big tables.
>> - The biggest partition I found was 90Mb but that table has only 2 sstables 
>> attached and compacts in seconds. The rest is mostly 1 line partition with a 
>> few 10KB of data.
>> - Worst SSTable case: SSTables in each level: [1, 20/10, 106/100, 15, 0, 0, 
>> 0, 0, 0]
>> 
>> In the metrics it looks something like this before dying: 
>> https://ibb.co/kLhdXH 
>> 
>> What the heap dump looks like of the top objects: https://ibb.co/ctkyXH 
>> 
>> 
>> The load is usually pretty low, the nodes are almost idling (avg 500 
>> reads/sec, 30-40 writes/sec with occasional few second spikes with >100 
>> writes) and the pending tasks is also around 0 usually.
>> 
>> Any ideas? I'm starting to run out of ideas. Maybe the secondary indexes 
>> cause problems? I could finish some bigger compactions where there was no 
>> index attached but I'm not sure 100% if this is the cause.
>> 
>> Thanks,
>> Zsolt
>> 
>> 
>> 
> 



Re: OOM after a while during compacting

2018-04-05 Thread Evelyn Smith
Probably a dumb question but it’s good to clarify.

Are you compacting the whole keyspace or are you compacting tables one at a 
time?

> On 5 Apr 2018, at 9:47 pm, Zsolt Pálmai  wrote:
> 
> Hi!
> 
> I have a setup with 4 AWS nodes (m4xlarge - 4 cpu, 16gb ram, 1TB ssd each) 
> and when running the nodetool compact command on any of the servers I get out 
> of memory exception after a while.
> 
> - Before calling the compact first I did a repair and before that there was a 
> bigger update on a lot of entries so I guess a lot of sstables were created. 
> The reapir created around ~250 pending compaction tasks, 2 of the nodes I 
> managed to finish with upgrading to a 2xlarge machine and twice the heap (but 
> running the compact on them manually also killed one :/ so this isn't an 
> ideal solution)
> 
> Some more info: 
> - Version is the newest 3.11.2 with java8u116
> - Using LeveledCompactionStrategy (we have mostly reads)
> - Heap size is set to 8GB
> - Using G1GC
> - I tried moving the memtable out of the heap. It helped but I still got an 
> OOM last night
> - Concurrent compactors is set to 1 but it still happens and also tried 
> setting throughput between 16 and 128, no changes.
> - Storage load is 127Gb/140Gb/151Gb/155Gb
> - 1 keyspace, 16 tables but there are a few SASI indexes on big tables.
> - The biggest partition I found was 90Mb but that table has only 2 sstables 
> attached and compacts in seconds. The rest is mostly 1 line partition with a 
> few 10KB of data.
> - Worst SSTable case: SSTables in each level: [1, 20/10, 106/100, 15, 0, 0, 
> 0, 0, 0]
> 
> In the metrics it looks something like this before dying: 
> https://ibb.co/kLhdXH 
> 
> What the heap dump looks like of the top objects: https://ibb.co/ctkyXH 
> 
> 
> The load is usually pretty low, the nodes are almost idling (avg 500 
> reads/sec, 30-40 writes/sec with occasional few second spikes with >100 
> writes) and the pending tasks is also around 0 usually.
> 
> Any ideas? I'm starting to run out of ideas. Maybe the secondary indexes 
> cause problems? I could finish some bigger compactions where there was no 
> index attached but I'm not sure 100% if this is the cause.
> 
> Thanks,
> Zsolt
> 
> 
> 



Re: OOM on Apache Cassandra on 30 Plus node at the same time

2017-03-07 Thread Shravan C
In fact I truncated hints table to stabilize the cluster. Through the heap 
dumps I was able to identify the table on which there were numerous queries. 
Then I focused on system_traces.session table around the time OOM occurred. It 
turned out to be a full table scan on a large table which caused OOM.


Thanks everyone of you.

From: Jeff Jirsa <jji...@apache.org>
Sent: Tuesday, March 7, 2017 1:19 PM
To: user@cassandra.apache.org
Subject: Re: OOM on Apache Cassandra on 30 Plus node at the same time



On 2017-03-03 09:18 (-0800), Shravan Ch <chall...@outlook.com> wrote:
>
> nodetool compactionstats -H
> pending tasks: 3
> compaction typekeyspace  table   
> completed  totalunit   progress
> Compaction  system  hints 
> 28.5 GB   92.38 GB   bytes 30.85%
>
>

The hint buildup is also something that could have caused OOMs, too. Hints are 
stored for a given host in a single partition, which means it's common for a 
single row/partition to get huge if you have a single host flapping.

If you see "Compacting large row" messages for the hint rows, I suspect you'll 
find that one of the hosts/rows is responsible for most of that 92GB of hints, 
which means when you try to deliver the hints, you'll read from a huge 
partition, which creates memory pressure (see: CASSANDRA-9754) leading to GC 
pauses (or ooms), which then causes you to flap, which causes you to create 
more hints, which causes an ugly spiral.

In 3.0, hints were rewritten to avoid this problem, but short term, you may 
need to truncate your hints to get healthy (assuming it's safe for you to do 
so, where 'safe' is based on your read+write consistency levels).




Re: OOM on Apache Cassandra on 30 Plus node at the same time

2017-03-07 Thread Jeff Jirsa


On 2017-03-03 09:18 (-0800), Shravan Ch  wrote: 
> 
> nodetool compactionstats -H
> pending tasks: 3
> compaction typekeyspace  table   
> completed  totalunit   progress
> Compaction  system  hints 
> 28.5 GB   92.38 GB   bytes 30.85%
> 
> 

The hint buildup is also something that could have caused OOMs, too. Hints are 
stored for a given host in a single partition, which means it's common for a 
single row/partition to get huge if you have a single host flapping.

If you see "Compacting large row" messages for the hint rows, I suspect you'll 
find that one of the hosts/rows is responsible for most of that 92GB of hints, 
which means when you try to deliver the hints, you'll read from a huge 
partition, which creates memory pressure (see: CASSANDRA-9754) leading to GC 
pauses (or ooms), which then causes you to flap, which causes you to create 
more hints, which causes an ugly spiral.

In 3.0, hints were rewritten to avoid this problem, but short term, you may 
need to truncate your hints to get healthy (assuming it's safe for you to do 
so, where 'safe' is based on your read+write consistency levels).




Re: OOM on Apache Cassandra on 30 Plus node at the same time

2017-03-07 Thread Jeff Jirsa


On 2017-03-04 07:23 (-0800), "Thakrar, Jayesh"  
wrote: 
> LCS does not rule out frequent updates - it just says that there will be more 
> frequent compaction, which can potentially increase compaction activity 
> (which again can be throttled as needed).
> But STCS will guarantee OOM when you have large datasets.
> Did you have a look at the offheap + onheap size of our jvm using "nodetool 
> -info" ?
> 
> 

STCS does not guarantee you OOM when you have large datasets, unless by large 
datasets you mean in the tens-of-terabytes range, which is already something we 
typically recommend against.




Re: OOM on Apache Cassandra on 30 Plus node at the same time

2017-03-06 Thread Eric Evans
On Fri, Mar 3, 2017 at 11:18 AM, Shravan Ch  wrote:
> More than 30 plus Cassandra servers in the primary DC went down OOM
> exception below. What puzzles me is the scale at which it happened (at the
> same minute). I will share some more details below.

You'd be surprised; When it's the result of aberrant data/workload,
then having many nodes OOM at once is more common than you might
think.

> System Log: http://pastebin.com/iPeYrWVR

The traceback shows the OOM occurring during a read (a slice), not a
write.  What does your data model and queries look like?  Do you do
deletes (TTLs maybe)? Did the OOM result in a heap dump?

> GC Log: http://pastebin.com/CzNNGs0r
>
> During the OOM I saw lot of WARNings like the below (these were there for
> quite sometime may be weeks)
> WARN  [SharedPool-Worker-81] 2017-03-01 19:55:41,209 BatchStatement.java:252
> - Batch of prepared statements for [keyspace.table] is of size 225455,
> exceeding specified threshold of 65536 by 159919.
>
> Environment:
> We are using ApacheCassandra-2.1.9 on Multi DC cluster. Primary DC (more C*
> nodes on SSD and apps run here)  and secondary DC (geographically remote and
> more like a DR to primary) on SAS drives.
> Cassandra config:
>
> Java 1.8.0_65
> Garbage Collector: G1GC
> memtable_allocation_type: offheap_objects
>
> Post this OOM I am seeing huge hints pile up on majority of the nodes and
> the pending hints keep going up. I have increased HintedHandoff CoreThreads
> to 6 but that did not help (I admit that I tried this on one node to try).
>
> nodetool compactionstats -H
> pending tasks: 3
> compaction typekeyspace  table
> completed  totalunit   progress
> Compaction  system  hints
> 28.5 GB   92.38 GB   bytes 30.85%



-- 
Eric Evans
john.eric.ev...@gmail.com


Re: OOM on Apache Cassandra on 30 Plus node at the same time

2017-03-04 Thread Thakrar, Jayesh
If possible, I would suggest running that command on a periodic basis (cron or 
whatever).
Also, you can run it on a single server and iterate through all the nodes in 
the cluster/DC.
Would also recommend running "nodetool compactionstats

And looked at your concern about high value for hinted handoff.
That's good (in a way), it ensures that updates are not lost.
Its possible because your DB was constantly being updated and the servers that 
come up started accumulating for the servers that were still down.
Furthermore, that may have been the situation also as the servers were going 
down.
Hence high hinted handoff is just a sign of pending updates that need to be 
applied, which is not uncommon if you had servers falling down/restarting like 
dominos and updates still coming in.

From: Shravan C <chall...@outlook.com>
Date: Saturday, March 4, 2017 at 11:15 AM
To: Conversant <jthak...@conversantmedia.com>, Joaquin Casares 
<joaq...@thelastpickle.com>, "user@cassandra.apache.org" 
<user@cassandra.apache.org>
Subject: Re: OOM on Apache Cassandra on 30 Plus node at the same time


I was looking at nodetool info across all nodes. Consistently JVM heap used is 
~ 12GB and off heap is ~ 4-5GB.


From: Thakrar, Jayesh <jthak...@conversantmedia.com>
Sent: Saturday, March 4, 2017 9:23:01 AM
To: Shravan C; Joaquin Casares; user@cassandra.apache.org
Subject: Re: OOM on Apache Cassandra on 30 Plus node at the same time

LCS does not rule out frequent updates - it just says that there will be more 
frequent compaction, which can potentially increase compaction activity (which 
again can be throttled as needed).
But STCS will guarantee OOM when you have large datasets.
Did you have a look at the offheap + onheap size of our jvm using "nodetool 
-info" ?


From: Shravan C <chall...@outlook.com>
Date: Friday, March 3, 2017 at 11:11 PM
To: Joaquin Casares <joaq...@thelastpickle.com>, "user@cassandra.apache.org" 
<user@cassandra.apache.org>
Subject: Re: OOM on Apache Cassandra on 30 Plus node at the same time


We run C* at 32 GB and all servers have 96GB RAM. We use STCS . LCS is not an 
option for us as we have frequent updates.


Thanks,
Shravan

From: Thakrar, Jayesh <jthak...@conversantmedia.com>
Sent: Friday, March 3, 2017 3:47:27 PM
To: Joaquin Casares; user@cassandra.apache.org
Subject: Re: OOM on Apache Cassandra on 30 Plus node at the same time


Had been fighting a similar battle, but am now over the hump for most part.



Get info on the server config (e.g. memory, cpu, free memory (free -g), etc)

Run "nodetool info" on the nodes to get heap and off-heap sizes

Run "nodetool tablestats" or "nodetool tablestats ." on the 
key large tables

Essentially the purpose is to see if you really had a true OOM or was your 
machine running out of memory.



Cassandra can use offheap memory very well - so "nodetool info" will give you 
both heap and offheap.



Also, what is the compaction strategy of your tables?



Personally, I have found STCS to be awful at large scale - when you have 
sstables that are 100+ GB in size.

See 
https://issues.apache.org/jira/browse/CASSANDRA-10821?focusedCommentId=15389451=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15389451



LCS seems better and should be the default (my opinion) unless you want DTCS



A good description of all three compactions is here - 
http://docs.scylladb.com/kb/compaction/
Documentation<http://docs.scylladb.com/kb/compaction/>
docs.scylladb.com
Scylla is a Cassandra-compatible NoSQL data store that can handle 1 million 
transactions per second on a single server.








From: Joaquin Casares <joaq...@thelastpickle.com>
Date: Friday, March 3, 2017 at 11:34 AM
To: <user@cassandra.apache.org>
Subject: Re: OOM on Apache Cassandra on 30 Plus node at the same time



Hello Shravan,



Typically asynchronous requests are recommended over batch statements since 
batch statements will cause more work on the coordinator node while individual 
requests, when using a TokenAwarePolicy, will hit a specific coordinator, 
perform a local disk seek, and return the requested information.



The only times that using batch statements are ideal is if writing to the same 
partition key, even if it's across multiple tables when using the same hashing 
algorithm (like murmur3).



Could you provide a bit of insight into what the batch statement was trying to 
accomplish and how many child statements were bundled up within that batch?



Cheers,



Joaquin


Joaquin Casares

Consultant

Austin, TX



Apache Cassandra Consulting

http://www.thelastpickle.com
The Last Pickle • Apache Cassandra Consulting & 
Services<http://www.thelastpickle.com/>
www.thelastpickle.com
Apache Cassandra Consulting & Services. Our wealth of experi

Re: OOM on Apache Cassandra on 30 Plus node at the same time

2017-03-04 Thread Priyanka


Sent from my iPhone

> On Mar 3, 2017, at 12:18 PM, Shravan Ch  wrote:
> 
> Hello,
> 
> More than 30 plus Cassandra servers in the primary DC went down OOM exception 
> below. What puzzles me is the scale at which it happened (at the same 
> minute). I will share some more details below. 
> 
> System Log: http://pastebin.com/iPeYrWVR
> GC Log: http://pastebin.com/CzNNGs0r
> 
> During the OOM I saw lot of WARNings like the below (these were there for 
> quite sometime may be weeks)
> WARN  [SharedPool-Worker-81] 2017-03-01 19:55:41,209 BatchStatement.java:252 
> - Batch of prepared statements for [keyspace.table] is of size 225455, 
> exceeding specified threshold of 65536 by 159919.
> 
> Environment:
> We are using ApacheCassandra-2.1.9 on Multi DC cluster. Primary DC (more C* 
> nodes on SSD and apps run here)  and secondary DC (geographically remote and 
> more like a DR to primary) on SAS drives. 
> Cassandra config:
> 
> Java 1.8.0_65
> Garbage Collector: G1GC
> memtable_allocation_type: offheap_objects
> 
> Post this OOM I am seeing huge hints pile up on majority of the nodes and the 
> pending hints keep going up. I have increased HintedHandoff CoreThreads to 6 
> but that did not help (I admit that I tried this on one node to try).
> 
> nodetool compactionstats -H
> pending tasks: 3
> compaction typekeyspace  table   
> completed  totalunit   progress
> Compaction  system  hints 
> 28.5 GB   92.38 GB   bytes 30.85%
> 
> 
> Appreciate your inputs here. 
> 
> Thanks,
> Shravan


Re: OOM on Apache Cassandra on 30 Plus node at the same time

2017-03-04 Thread Shravan C
I was looking at nodetool info across all nodes. Consistently JVM heap used is 
~ 12GB and off heap is ~ 4-5GB.


From: Thakrar, Jayesh <jthak...@conversantmedia.com>
Sent: Saturday, March 4, 2017 9:23:01 AM
To: Shravan C; Joaquin Casares; user@cassandra.apache.org
Subject: Re: OOM on Apache Cassandra on 30 Plus node at the same time

LCS does not rule out frequent updates - it just says that there will be more 
frequent compaction, which can potentially increase compaction activity (which 
again can be throttled as needed).
But STCS will guarantee OOM when you have large datasets.
Did you have a look at the offheap + onheap size of our jvm using "nodetool 
-info" ?


From: Shravan C <chall...@outlook.com>
Date: Friday, March 3, 2017 at 11:11 PM
To: Joaquin Casares <joaq...@thelastpickle.com>, "user@cassandra.apache.org" 
<user@cassandra.apache.org>
Subject: Re: OOM on Apache Cassandra on 30 Plus node at the same time


We run C* at 32 GB and all servers have 96GB RAM. We use STCS . LCS is not an 
option for us as we have frequent updates.


Thanks,
Shravan

From: Thakrar, Jayesh <jthak...@conversantmedia.com>
Sent: Friday, March 3, 2017 3:47:27 PM
To: Joaquin Casares; user@cassandra.apache.org
Subject: Re: OOM on Apache Cassandra on 30 Plus node at the same time


Had been fighting a similar battle, but am now over the hump for most part.



Get info on the server config (e.g. memory, cpu, free memory (free -g), etc)

Run "nodetool info" on the nodes to get heap and off-heap sizes

Run "nodetool tablestats" or "nodetool tablestats ." on the 
key large tables

Essentially the purpose is to see if you really had a true OOM or was your 
machine running out of memory.



Cassandra can use offheap memory very well - so "nodetool info" will give you 
both heap and offheap.



Also, what is the compaction strategy of your tables?



Personally, I have found STCS to be awful at large scale - when you have 
sstables that are 100+ GB in size.

See 
https://issues.apache.org/jira/browse/CASSANDRA-10821?focusedCommentId=15389451=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15389451



LCS seems better and should be the default (my opinion) unless you want DTCS



A good description of all three compactions is here - 
http://docs.scylladb.com/kb/compaction/
Documentation<http://docs.scylladb.com/kb/compaction/>
docs.scylladb.com
Scylla is a Cassandra-compatible NoSQL data store that can handle 1 million 
transactions per second on a single server.








From: Joaquin Casares <joaq...@thelastpickle.com>
Date: Friday, March 3, 2017 at 11:34 AM
To: <user@cassandra.apache.org>
Subject: Re: OOM on Apache Cassandra on 30 Plus node at the same time



Hello Shravan,



Typically asynchronous requests are recommended over batch statements since 
batch statements will cause more work on the coordinator node while individual 
requests, when using a TokenAwarePolicy, will hit a specific coordinator, 
perform a local disk seek, and return the requested information.



The only times that using batch statements are ideal is if writing to the same 
partition key, even if it's across multiple tables when using the same hashing 
algorithm (like murmur3).



Could you provide a bit of insight into what the batch statement was trying to 
accomplish and how many child statements were bundled up within that batch?



Cheers,



Joaquin


Joaquin Casares

Consultant

Austin, TX



Apache Cassandra Consulting

http://www.thelastpickle.com
The Last Pickle • Apache Cassandra Consulting & 
Services<http://www.thelastpickle.com/>
www.thelastpickle.com
Apache Cassandra Consulting & Services. Our wealth of experience with Apache 
Cassandra will ensure success at all stages of a your project lifecycle.




On Fri, Mar 3, 2017 at 11:18 AM, Shravan Ch 
<chall...@outlook.com<mailto:chall...@outlook.com>> wrote:

Hello,

More than 30 plus Cassandra servers in the primary DC went down OOM exception 
below. What puzzles me is the scale at which it happened (at the same minute). 
I will share some more details below.

System Log: http://pastebin.com/iPeYrWVR

GC Log: http://pastebin.com/CzNNGs0r

During the OOM I saw lot of WARNings like the below (these were there for quite 
sometime may be weeks)
WARN  [SharedPool-Worker-81] 2017-03-01 19:55:41,209 BatchStatement.java:252 - 
Batch of prepared statements for [keyspace.table] is of size 225455, exceeding 
specified threshold of 65536 by 159919.

Environment:
We are using ApacheCassandra-2.1.9 on Multi DC cluster. Primary DC (more C* 
nodes on SSD and apps run here)  and secondary DC (geographically remote and 
more like a DR to primary) on SAS drives.
Cassandra config:

Java 1.8.0_65
Garbage Collector: G1GC
memtable_allocation_type: offheap_objects

Post this OOM I am se

Re: OOM on Apache Cassandra on 30 Plus node at the same time

2017-03-04 Thread Edward Capriolo
On Saturday, March 4, 2017, Thakrar, Jayesh <jthak...@conversantmedia.com>
wrote:

> LCS does not rule out frequent updates - it just says that there will be
> more frequent compaction, which can potentially increase compaction
> activity (which again can be throttled as needed).
>
> But STCS will guarantee OOM when you have large datasets.
>
> Did you have a look at the offheap + onheap size of our jvm using
> "nodetool -info" ?
>
>
>
>
>
> *From: *Shravan C <chall...@outlook.com
> <javascript:_e(%7B%7D,'cvml','chall...@outlook.com');>>
> *Date: *Friday, March 3, 2017 at 11:11 PM
> *To: *Joaquin Casares <joaq...@thelastpickle.com
> <javascript:_e(%7B%7D,'cvml','joaq...@thelastpickle.com');>>, "
> user@cassandra.apache.org
> <javascript:_e(%7B%7D,'cvml','user@cassandra.apache.org');>" <
> user@cassandra.apache.org
> <javascript:_e(%7B%7D,'cvml','user@cassandra.apache.org');>>
> *Subject: *Re: OOM on Apache Cassandra on 30 Plus node at the same time
>
>
>
> We run C* at 32 GB and all servers have 96GB RAM. We use STCS . LCS is not
> an option for us as we have frequent updates.
>
>
>
> Thanks,
>
> Shravan
> --
>
> *From:* Thakrar, Jayesh <jthak...@conversantmedia.com
> <javascript:_e(%7B%7D,'cvml','jthak...@conversantmedia.com');>>
> *Sent:* Friday, March 3, 2017 3:47:27 PM
> *To:* Joaquin Casares; user@cassandra.apache.org
> <javascript:_e(%7B%7D,'cvml','user@cassandra.apache.org');>
> *Subject:* Re: OOM on Apache Cassandra on 30 Plus node at the same time
>
>
>
> Had been fighting a similar battle, but am now over the hump for most part.
>
>
>
> Get info on the server config (e.g. memory, cpu, free memory (free -g),
> etc)
>
> Run "nodetool info" on the nodes to get heap and off-heap sizes
>
> Run "nodetool tablestats" or "nodetool tablestats ."
> on the key large tables
>
> Essentially the purpose is to see if you really had a true OOM or was your
> machine running out of memory.
>
>
>
> Cassandra can use offheap memory very well - so "nodetool info" will give
> you both heap and offheap.
>
>
>
> Also, what is the compaction strategy of your tables?
>
>
>
> Personally, I have found STCS to be awful at large scale - when you have
> sstables that are 100+ GB in size.
>
> See https://issues.apache.org/jira/browse/CASSANDRA-10821?
> focusedCommentId=15389451=com.atlassian.jira.
> plugin.system.issuetabpanels:comment-tabpanel#comment-15389451
>
>
>
> LCS seems better and should be the default (my opinion) unless you want
> DTCS
>
>
>
> A good description of all three compactions is here -
> http://docs.scylladb.com/kb/compaction/
>
> Documentation <http://docs.scylladb.com/kb/compaction/>
>
> docs.scylladb.com
>
> Scylla is a Cassandra-compatible NoSQL data store that can handle 1
> million transactions per second on a single server.
>
>
>
>
>
>
>
> *From: *Joaquin Casares <joaq...@thelastpickle.com
> <javascript:_e(%7B%7D,'cvml','joaq...@thelastpickle.com');>>
> *Date: *Friday, March 3, 2017 at 11:34 AM
> *To: *<user@cassandra.apache.org
> <javascript:_e(%7B%7D,'cvml','user@cassandra.apache.org');>>
> *Subject: *Re: OOM on Apache Cassandra on 30 Plus node at the same time
>
>
>
> Hello Shravan,
>
>
>
> Typically asynchronous requests are recommended over batch statements
> since batch statements will cause more work on the coordinator node while
> individual requests, when using a TokenAwarePolicy, will hit a specific
> coordinator, perform a local disk seek, and return the requested
> information.
>
>
>
> The only times that using batch statements are ideal is if writing to the
> same partition key, even if it's across multiple tables when using the same
> hashing algorithm (like murmur3).
>
>
>
> Could you provide a bit of insight into what the batch statement was
> trying to accomplish and how many child statements were bundled up within
> that batch?
>
>
>
> Cheers,
>
>
>
> Joaquin
>
>
> Joaquin Casares
>
> Consultant
>
> Austin, TX
>
>
>
> Apache Cassandra Consulting
>
> http://www.thelastpickle.com
>
> The Last Pickle • Apache Cassandra Consulting & Services
> <http://www.thelastpickle.com/>
>
> www.thelastpickle.com
>
> Apache Cassandra Consulting & Services. Our wealth of experience with
> Apache Cassandra will ensure success at all stages of a your project
> lifecycle.
>
>
>
> On Fri, Mar 3, 2017 at 11:18 AM, Shrava

Re: OOM on Apache Cassandra on 30 Plus node at the same time

2017-03-04 Thread Thakrar, Jayesh
LCS does not rule out frequent updates - it just says that there will be more 
frequent compaction, which can potentially increase compaction activity (which 
again can be throttled as needed).
But STCS will guarantee OOM when you have large datasets.
Did you have a look at the offheap + onheap size of our jvm using "nodetool 
-info" ?


From: Shravan C <chall...@outlook.com>
Date: Friday, March 3, 2017 at 11:11 PM
To: Joaquin Casares <joaq...@thelastpickle.com>, "user@cassandra.apache.org" 
<user@cassandra.apache.org>
Subject: Re: OOM on Apache Cassandra on 30 Plus node at the same time


We run C* at 32 GB and all servers have 96GB RAM. We use STCS . LCS is not an 
option for us as we have frequent updates.


Thanks,
Shravan

From: Thakrar, Jayesh <jthak...@conversantmedia.com>
Sent: Friday, March 3, 2017 3:47:27 PM
To: Joaquin Casares; user@cassandra.apache.org
Subject: Re: OOM on Apache Cassandra on 30 Plus node at the same time


Had been fighting a similar battle, but am now over the hump for most part.



Get info on the server config (e.g. memory, cpu, free memory (free -g), etc)

Run "nodetool info" on the nodes to get heap and off-heap sizes

Run "nodetool tablestats" or "nodetool tablestats ." on the 
key large tables

Essentially the purpose is to see if you really had a true OOM or was your 
machine running out of memory.



Cassandra can use offheap memory very well - so "nodetool info" will give you 
both heap and offheap.



Also, what is the compaction strategy of your tables?



Personally, I have found STCS to be awful at large scale - when you have 
sstables that are 100+ GB in size.

See 
https://issues.apache.org/jira/browse/CASSANDRA-10821?focusedCommentId=15389451=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15389451



LCS seems better and should be the default (my opinion) unless you want DTCS



A good description of all three compactions is here - 
http://docs.scylladb.com/kb/compaction/
Documentation<http://docs.scylladb.com/kb/compaction/>
docs.scylladb.com
Scylla is a Cassandra-compatible NoSQL data store that can handle 1 million 
transactions per second on a single server.








From: Joaquin Casares <joaq...@thelastpickle.com>
Date: Friday, March 3, 2017 at 11:34 AM
To: <user@cassandra.apache.org>
Subject: Re: OOM on Apache Cassandra on 30 Plus node at the same time



Hello Shravan,



Typically asynchronous requests are recommended over batch statements since 
batch statements will cause more work on the coordinator node while individual 
requests, when using a TokenAwarePolicy, will hit a specific coordinator, 
perform a local disk seek, and return the requested information.



The only times that using batch statements are ideal is if writing to the same 
partition key, even if it's across multiple tables when using the same hashing 
algorithm (like murmur3).



Could you provide a bit of insight into what the batch statement was trying to 
accomplish and how many child statements were bundled up within that batch?



Cheers,



Joaquin


Joaquin Casares

Consultant

Austin, TX



Apache Cassandra Consulting

http://www.thelastpickle.com
The Last Pickle • Apache Cassandra Consulting & 
Services<http://www.thelastpickle.com/>
www.thelastpickle.com
Apache Cassandra Consulting & Services. Our wealth of experience with Apache 
Cassandra will ensure success at all stages of a your project lifecycle.




On Fri, Mar 3, 2017 at 11:18 AM, Shravan Ch 
<chall...@outlook.com<mailto:chall...@outlook.com>> wrote:

Hello,

More than 30 plus Cassandra servers in the primary DC went down OOM exception 
below. What puzzles me is the scale at which it happened (at the same minute). 
I will share some more details below.

System Log: http://pastebin.com/iPeYrWVR

GC Log: http://pastebin.com/CzNNGs0r

During the OOM I saw lot of WARNings like the below (these were there for quite 
sometime may be weeks)
WARN  [SharedPool-Worker-81] 2017-03-01 19:55:41,209 BatchStatement.java:252 - 
Batch of prepared statements for [keyspace.table] is of size 225455, exceeding 
specified threshold of 65536 by 159919.

Environment:
We are using ApacheCassandra-2.1.9 on Multi DC cluster. Primary DC (more C* 
nodes on SSD and apps run here)  and secondary DC (geographically remote and 
more like a DR to primary) on SAS drives.
Cassandra config:

Java 1.8.0_65
Garbage Collector: G1GC
memtable_allocation_type: offheap_objects

Post this OOM I am seeing huge hints pile up on majority of the nodes and the 
pending hints keep going up. I have increased HintedHandoff CoreThreads to 6 
but that did not help (I admit that I tried this on one node to try).

nodetool compactionstats -H
pending tasks: 3
compaction typekeyspace  table   completed  
totalunit   progress
Compac

Re: OOM on Apache Cassandra on 30 Plus node at the same time

2017-03-03 Thread Shravan C
We run C* at 32 GB and all servers have 96GB RAM. We use STCS . LCS is not an 
option for us as we have frequent updates.


Thanks,
Shravan

From: Thakrar, Jayesh <jthak...@conversantmedia.com>
Sent: Friday, March 3, 2017 3:47:27 PM
To: Joaquin Casares; user@cassandra.apache.org
Subject: Re: OOM on Apache Cassandra on 30 Plus node at the same time


Had been fighting a similar battle, but am now over the hump for most part.



Get info on the server config (e.g. memory, cpu, free memory (free -g), etc)

Run "nodetool info" on the nodes to get heap and off-heap sizes

Run "nodetool tablestats" or "nodetool tablestats ." on the 
key large tables

Essentially the purpose is to see if you really had a true OOM or was your 
machine running out of memory.



Cassandra can use offheap memory very well - so "nodetool info" will give you 
both heap and offheap.



Also, what is the compaction strategy of your tables?



Personally, I have found STCS to be awful at large scale - when you have 
sstables that are 100+ GB in size.

See 
https://issues.apache.org/jira/browse/CASSANDRA-10821?focusedCommentId=15389451=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15389451



LCS seems better and should be the default (my opinion) unless you want DTCS



A good description of all three compactions is here - 
http://docs.scylladb.com/kb/compaction/

Documentation<http://docs.scylladb.com/kb/compaction/>
docs.scylladb.com
Scylla is a Cassandra-compatible NoSQL data store that can handle 1 million 
transactions per second on a single server.








From: Joaquin Casares <joaq...@thelastpickle.com>
Date: Friday, March 3, 2017 at 11:34 AM
To: <user@cassandra.apache.org>
Subject: Re: OOM on Apache Cassandra on 30 Plus node at the same time



Hello Shravan,



Typically asynchronous requests are recommended over batch statements since 
batch statements will cause more work on the coordinator node while individual 
requests, when using a TokenAwarePolicy, will hit a specific coordinator, 
perform a local disk seek, and return the requested information.



The only times that using batch statements are ideal is if writing to the same 
partition key, even if it's across multiple tables when using the same hashing 
algorithm (like murmur3).



Could you provide a bit of insight into what the batch statement was trying to 
accomplish and how many child statements were bundled up within that batch?



Cheers,



Joaquin


Joaquin Casares

Consultant

Austin, TX



Apache Cassandra Consulting

http://www.thelastpickle.com

The Last Pickle • Apache Cassandra Consulting & 
Services<http://www.thelastpickle.com/>
www.thelastpickle.com
Apache Cassandra Consulting & Services. Our wealth of experience with Apache 
Cassandra will ensure success at all stages of a your project lifecycle.




On Fri, Mar 3, 2017 at 11:18 AM, Shravan Ch 
<chall...@outlook.com<mailto:chall...@outlook.com>> wrote:

Hello,

More than 30 plus Cassandra servers in the primary DC went down OOM exception 
below. What puzzles me is the scale at which it happened (at the same minute). 
I will share some more details below.

System Log: http://pastebin.com/iPeYrWVR

GC Log: http://pastebin.com/CzNNGs0r

During the OOM I saw lot of WARNings like the below (these were there for quite 
sometime may be weeks)
WARN  [SharedPool-Worker-81] 2017-03-01 19:55:41,209 BatchStatement.java:252 - 
Batch of prepared statements for [keyspace.table] is of size 225455, exceeding 
specified threshold of 65536 by 159919.

Environment:
We are using ApacheCassandra-2.1.9 on Multi DC cluster. Primary DC (more C* 
nodes on SSD and apps run here)  and secondary DC (geographically remote and 
more like a DR to primary) on SAS drives.
Cassandra config:

Java 1.8.0_65
Garbage Collector: G1GC
memtable_allocation_type: offheap_objects

Post this OOM I am seeing huge hints pile up on majority of the nodes and the 
pending hints keep going up. I have increased HintedHandoff CoreThreads to 6 
but that did not help (I admit that I tried this on one node to try).

nodetool compactionstats -H
pending tasks: 3
compaction typekeyspace  table   completed  
totalunit   progress
Compaction  system  hints 28.5 
GB   92.38 GB   bytes 30.85%


Appreciate your inputs here.

Thanks,

Shravan




Re: OOM on Apache Cassandra on 30 Plus node at the same time

2017-03-03 Thread Thakrar, Jayesh
Had been fighting a similar battle, but am now over the hump for most part.

Get info on the server config (e.g. memory, cpu, free memory (free -g), etc)
Run "nodetool info" on the nodes to get heap and off-heap sizes
Run "nodetool tablestats" or "nodetool tablestats ." on the 
key large tables
Essentially the purpose is to see if you really had a true OOM or was your 
machine running out of memory.

Cassandra can use offheap memory very well - so "nodetool info" will give you 
both heap and offheap.

Also, what is the compaction strategy of your tables?

Personally, I have found STCS to be awful at large scale - when you have 
sstables that are 100+ GB in size.
See 
https://issues.apache.org/jira/browse/CASSANDRA-10821?focusedCommentId=15389451=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15389451

LCS seems better and should be the default (my opinion) unless you want DTCS

A good description of all three compactions is here - 
http://docs.scylladb.com/kb/compaction/



From: Joaquin Casares <joaq...@thelastpickle.com>
Date: Friday, March 3, 2017 at 11:34 AM
To: <user@cassandra.apache.org>
Subject: Re: OOM on Apache Cassandra on 30 Plus node at the same time

Hello Shravan,

Typically asynchronous requests are recommended over batch statements since 
batch statements will cause more work on the coordinator node while individual 
requests, when using a TokenAwarePolicy, will hit a specific coordinator, 
perform a local disk seek, and return the requested information.

The only times that using batch statements are ideal is if writing to the same 
partition key, even if it's across multiple tables when using the same hashing 
algorithm (like murmur3).

Could you provide a bit of insight into what the batch statement was trying to 
accomplish and how many child statements were bundled up within that batch?

Cheers,

Joaquin

Joaquin Casares
Consultant
Austin, TX

Apache Cassandra Consulting
http://www.thelastpickle.com

On Fri, Mar 3, 2017 at 11:18 AM, Shravan Ch 
<chall...@outlook.com<mailto:chall...@outlook.com>> wrote:
Hello,

More than 30 plus Cassandra servers in the primary DC went down OOM exception 
below. What puzzles me is the scale at which it happened (at the same minute). 
I will share some more details below.
System Log: http://pastebin.com/iPeYrWVR
GC Log: http://pastebin.com/CzNNGs0r

During the OOM I saw lot of WARNings like the below (these were there for quite 
sometime may be weeks)
WARN  [SharedPool-Worker-81] 2017-03-01 19:55:41,209 BatchStatement.java:252 - 
Batch of prepared statements for [keyspace.table] is of size 225455, exceeding 
specified threshold of 65536 by 159919.

Environment:
We are using ApacheCassandra-2.1.9 on Multi DC cluster. Primary DC (more C* 
nodes on SSD and apps run here)  and secondary DC (geographically remote and 
more like a DR to primary) on SAS drives.
Cassandra config:

Java 1.8.0_65
Garbage Collector: G1GC
memtable_allocation_type: offheap_objects

Post this OOM I am seeing huge hints pile up on majority of the nodes and the 
pending hints keep going up. I have increased HintedHandoff CoreThreads to 6 
but that did not help (I admit that I tried this on one node to try).
nodetool compactionstats -H
pending tasks: 3
compaction typekeyspace  table   completed  
totalunit   progress
Compaction  system  hints 28.5 
GB   92.38 GB   bytes 30.85%


Appreciate your inputs here.
Thanks,
Shravan



Re: OOM on Apache Cassandra on 30 Plus node at the same time

2017-03-03 Thread Shravan C
Hi Joaquin,


We have inserts going into a tracking table. Tracking table is a simple table 
[PRIMARY KEY (comid, status_timestamp) ] with a few tracking attributes and 
sorted by status_timestamp. From a volume perspective it is not a whole lot.


Thanks,
Shravan

From: Joaquin Casares <joaq...@thelastpickle.com>
Sent: Friday, March 3, 2017 11:34:58 AM
To: user@cassandra.apache.org
Subject: Re: OOM on Apache Cassandra on 30 Plus node at the same time

Hello Shravan,

Typically asynchronous requests are recommended over batch statements since 
batch statements will cause more work on the coordinator node while individual 
requests, when using a TokenAwarePolicy, will hit a specific coordinator, 
perform a local disk seek, and return the requested information.

The only times that using batch statements are ideal is if writing to the same 
partition key, even if it's across multiple tables when using the same hashing 
algorithm (like murmur3).

Could you provide a bit of insight into what the batch statement was trying to 
accomplish and how many child statements were bundled up within that batch?

Cheers,

Joaquin

Joaquin Casares
Consultant
Austin, TX

Apache Cassandra Consulting
http://www.thelastpickle.com

On Fri, Mar 3, 2017 at 11:18 AM, Shravan Ch 
<chall...@outlook.com<mailto:chall...@outlook.com>> wrote:
Hello,

More than 30 plus Cassandra servers in the primary DC went down OOM exception 
below. What puzzles me is the scale at which it happened (at the same minute). 
I will share some more details below.

System Log: http://pastebin.com/iPeYrWVR
GC Log: http://pastebin.com/CzNNGs0r

<http://pastebin.com/CzNNGs0r>During the OOM I saw lot of WARNings like the 
below (these were there for quite sometime may be weeks)
WARN  [SharedPool-Worker-81] 2017-03-01 19:55:41,209 BatchStatement.java:252 - 
Batch of prepared statements for [keyspace.table] is of size 225455, exceeding 
specified threshold of 65536 by 159919.

Environment:
We are using ApacheCassandra-2.1.9 on Multi DC cluster. Primary DC (more C* 
nodes on SSD and apps run here)  and secondary DC (geographically remote and 
more like a DR to primary) on SAS drives.
Cassandra config:

Java 1.8.0_65
Garbage Collector: G1GC
memtable_allocation_type: offheap_objects

Post this OOM I am seeing huge hints pile up on majority of the nodes and the 
pending hints keep going up. I have increased HintedHandoff CoreThreads to 6 
but that did not help (I admit that I tried this on one node to try).

nodetool compactionstats -H
pending tasks: 3
compaction typekeyspace  table   completed  
totalunit   progress
Compaction  system  hints 28.5 
GB   92.38 GB   bytes 30.85%


Appreciate your inputs here.

Thanks,
Shravan



Re: OOM on Apache Cassandra on 30 Plus node at the same time

2017-03-03 Thread Joaquin Casares
Hello Shravan,

Typically asynchronous requests are recommended over batch statements since
batch statements will cause more work on the coordinator node while
individual requests, when using a TokenAwarePolicy, will hit a specific
coordinator, perform a local disk seek, and return the requested
information.

The only times that using batch statements are ideal is if writing to the
same partition key, even if it's across multiple tables when using the same
hashing algorithm (like murmur3).

Could you provide a bit of insight into what the batch statement was trying
to accomplish and how many child statements were bundled up within that
batch?

Cheers,

Joaquin

Joaquin Casares
Consultant
Austin, TX

Apache Cassandra Consulting
http://www.thelastpickle.com

On Fri, Mar 3, 2017 at 11:18 AM, Shravan Ch  wrote:

> Hello,
>
> More than 30 plus Cassandra servers in the primary DC went down OOM
> exception below. What puzzles me is the scale at which it happened (at the
> same minute). I will share some more details below.
>
> System Log: http://pastebin.com/iPeYrWVR
> GC Log: http://pastebin.com/CzNNGs0r
>
> During the OOM I saw lot of WARNings like
> the below (these were there for quite sometime may be weeks)
> *WARN  [SharedPool-Worker-81] 2017-03-01 19:55:41,209
> BatchStatement.java:252 - Batch of prepared statements for [keyspace.table]
> is of size 225455, exceeding specified threshold of 65536 by 159919.*
>
> * Environment:*
> We are using ApacheCassandra-2.1.9 on Multi DC cluster. Primary DC (more
> C* nodes on SSD and apps run here)  and secondary DC (geographically remote
> and more like a DR to primary) on SAS drives.
> Cassandra config:
>
> Java 1.8.0_65
> Garbage Collector: G1GC
> memtable_allocation_type: offheap_objects
>
> Post this OOM I am seeing huge hints pile up on majority of the nodes and
> the pending hints keep going up. I have increased HintedHandoff
> CoreThreads to 6 but that did not help (I admit that I tried this on one
> node to try).
>
> nodetool compactionstats -H
> pending tasks: 3
> compaction typekeyspace  table
> completed  totalunit   progress
> Compaction  system  hints
> 28.5 GB   92.38 GB   bytes 30.85%
>
>
> Appreciate your inputs here.
>
> Thanks,
> Shravan
>


Re: OOM under high write throughputs on 2.2.5

2016-05-24 Thread Bryan Cheng
Hi Zhiyan,

Silly question but are you sure your heap settings are actually being
applied?  "697,236,904 (51.91%)" would represent a sub-2GB heap. What's the
real memory usage for Java when this crash happens?

Other thing to look into might be memtable_heap_space_in_mb, as it looks
like you're using onheap memtables. This will be 1/4 of your heap by
default. Assuming your heap settings are actually being applied, if you run
through this space you may not have enough flushing resources.
memtable_flush_Writers defaults to a somewhat low number which may not be
enough for this use case.

On Fri, May 20, 2016 at 10:02 PM, Zhiyan Shao  wrote:

> Hi, we see the following OOM crash while doing heavy write loading
> testing. Has anybody seen this kind of crash? We are using G1GC with 32GB
> heap size out of 128GB system memory. Eclipse Memory Analyzer shows the
> following:
>
> One instance of *"org.apache.cassandra.db.ColumnFamilyStore"* loaded by 
> *"sun.misc.Launcher$AppClassLoader
> @ 0x8d800898"* occupies *697,236,904 (51.91%)* bytes. The memory is
> accumulated in one instance of
> *"java.util.concurrent.ConcurrentSkipListMap$HeadIndex"* loaded by *" class loader>"*.
>
> *Keywords*
>
> java.util.concurrent.ConcurrentSkipListMap$HeadIndex
>
> sun.misc.Launcher$AppClassLoader @ 0x8d800898
>
> org.apache.cassandra.db.ColumnFamilyStore
>
> Cassandra log:
>
>
> ERROR 00:23:24 JVM state determined to be unstable.  Exiting forcefully
> due to:
> java.lang.OutOfMemoryError: Java heap space
> at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) ~[na:1.8.0_74]
> at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) ~[na:1.8.0_74]
> at
> org.apache.cassandra.utils.memory.SlabAllocator.getRegion(SlabAllocator.java:
> 137) ~[apache-cassandra-2.2.5.jar:2.2.5]
> at
> org.apache.cassandra.utils.memory.SlabAllocator.allocate(SlabAllocator.java:
> 97) ~[apache-cassandra-2.2.5.jar:2.2.5]
> at
> org.apache.cassandra.utils.memory.ContextAllocator.allocate(ContextAllocator.java:
> 57) ~[apache-cassandra-2.2.5.jar:2.2.5]
> at org.apache.cassandra.utils.memory.ContextAllocator.clone
> (ContextAllocator.java:47) ~[apache-cassandra-2.2.5.jar:2.2.5]
> at org.apache.cassandra.utils.memory.MemtableBufferAllocator.clone
> (MemtableBufferAllocator.java:61) ~[apache-cassandra-2.2.5.jar:2.2.5]
> at org.apache.cassandra.db.Memtable.put(Memtable.java:212)
> ~[apache-cassandra-2.2.5.jar:2.2.5]
> at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:
> 1249) ~[apache-cassandra-2.2.5.jar:2.2.5]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:406)
> ~[apache-cassandra-2.2.5.jar:2.2.5]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:366)
> ~[apache-cassandra-2.2.5.jar:2.2.5]
> at org.apache.cassandra.db.Mutation.apply(Mutation.java:214)
> ~[apache-cassandra-2.2.5.jar:2.2.5]
> at
> org.apache.cassandra.db.MutationVerbHandler.doVerb(MutationVerbHandler.java:
> 50) ~[apache-cassandra-2.2.5.jar:2.2.5]
> at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:
> 67) ~[apache-cassandra-2.2.5.jar:2.2.5]
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[na:1.8.0_74]
> at
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:
> 164) ~[apache-cassandra-2.2.5.jar:2.2.5]
> at
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:
> 136) [apache-cassandra-2.2.5.jar:2.2.5]
> at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
> [apache-cassandra-2.2.5.jar:2.2.5]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_74]
>
> Thanks,
> Zhiyan
>


Re: OOM when Adding host

2015-08-10 Thread rock zhang
My Cassandra version is 2.1.4.

Thanks
Rock 

On Aug 10, 2015, at 9:52 AM, rock zhang r...@alohar.com wrote:

 Hi All,
 
 Currently i have three hosts. The data is not balanced, one is 79G, another 
 two have 300GB. When I adding a new host, firstly I got too many open files 
 error, then i changed file open limit from 100,000 to 1, 000, 000. Then I got 
 OOM error.
 
 Should I change the limits to 20, instead of 1M?  My memory is 33G, i am 
 using EC2 c2*2xlarge.  Ideally even if the data is large, just slower, should 
 not OOM, don't understand why .
 
 I actually got this error pretty often. I guess the reason is because my data 
 is pretty large?  If cassandra try to split the data evenly on all host, then 
 Cassandra need to copy around 200GB to the new host. 
 
 From my experience, An alternative way to solve this is add new host as seed, 
 do not use Add host, then data would be move, so so OOM. But not sure data 
 will be lost or cannot be located. 
 
 Thanks
 Rock 
 



Re: OOM when Adding host

2015-08-10 Thread rock zhang
I logged the open files every 10 mins, last record is : 

lsof -p $cassadnraPID | wc -l

74728

lsof |wc-l
5887913   # this is a very large number, don't know why.

After OOM the open file numbers back to few hundreds (lsof | wc -l ). 




On Aug 10, 2015, at 9:59 AM, rock zhang r...@alohar.com wrote:

 My Cassandra version is 2.1.4.
 
 Thanks
 Rock 
 
 On Aug 10, 2015, at 9:52 AM, rock zhang r...@alohar.com wrote:
 
 Hi All,
 
 Currently i have three hosts. The data is not balanced, one is 79G, another 
 two have 300GB. When I adding a new host, firstly I got too many open 
 files error, then i changed file open limit from 100,000 to 1, 000, 000. 
 Then I got OOM error.
 
 Should I change the limits to 20, instead of 1M?  My memory is 33G, i am 
 using EC2 c2*2xlarge.  Ideally even if the data is large, just slower, 
 should not OOM, don't understand why .
 
 I actually got this error pretty often. I guess the reason is because my 
 data is pretty large?  If cassandra try to split the data evenly on all 
 host, then Cassandra need to copy around 200GB to the new host. 
 
 From my experience, An alternative way to solve this is add new host as 
 seed, do not use Add host, then data would be move, so so OOM. But not 
 sure data will be lost or cannot be located. 
 
 Thanks
 Rock 
 
 



Re: OOM and high SSTables count

2015-03-04 Thread daemeon reiydelle
Are you finding a correlation between the shards on the OOM DC1 nodes and
the OOM DC2 nodes? Does your monitoring tool indicate that the DC1 nodes
are using significantly more CPU (and memory) than the nodes that are NOT
failing? I am leading you down the path to suspect that your sharding is
giving you hot spots. Also are you using vnodes?

Patrick


 On Wed, Mar 4, 2015 at 9:26 AM, Jan cne...@yahoo.com wrote:

 HI Roni;

 You mentioned:
 DC1 servers have 32GB of RAM and 10GB of HEAP. DC2 machines have 16GB of
 RAM and 5GB HEAP.

 Best practices would be be to:
 a)  have a consistent type of node across both DC's.  (CPUs, Memory, Heap
  Disk)
 b)  increase heap on DC2 servers to be  8GB for C* Heap

 The leveled compaction issue is not addressed by this.
 hope this helps

 Jan/




   On Wednesday, March 4, 2015 8:41 AM, Roni Balthazar 
 ronibaltha...@gmail.com wrote:


 Hi there,

 We are running C* 2.1.3 cluster with 2 DataCenters: DC1: 30 Servers /
 DC2 - 10 Servers.
 DC1 servers have 32GB of RAM and 10GB of HEAP. DC2 machines have 16GB
 of RAM and 5GB HEAP.
 DC1 nodes have about 1.4TB of data and DC2 nodes 2.3TB.
 DC2 is used only for backup purposes. There are no reads on DC2.
 Every writes and reads are on DC1 using LOCAL_ONE and the RF DC1: 2 and
 DC2: 1.
 All keyspaces have STCS (Average 20~30 SSTables count each table on
 both DCs) except one that is using LCS (DC1: Avg 4K~7K SSTables / DC2:
 Avg 3K~14K SSTables).

 Basically we are running into 2 problems:

 1) High SSTables count on keyspace using LCS (This KS has 500GB~600GB
 of data on each DC1 node).
 2) There are 2 servers on DC1 and 4 servers in DC2 that went down with
 the OOM error message below:

 ERROR [SharedPool-Worker-111] 2015-03-04 05:03:26,394
 JVMStabilityInspector.java:94 - JVM state determined to be unstable.
 Exiting forcefully due to:
 java.lang.OutOfMemoryError: Java heap space
 at
 org.apache.cassandra.db.composites.CompoundSparseCellNameType.copyAndMakeWith(CompoundSparseCellNameType.java:186)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.composites.AbstractCompoundCellNameType$CompositeDeserializer.readNext(AbstractCompoundCellNameType.java:286)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.AtomDeserializer.readNext(AtomDeserializer.java:104)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:426)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.fetchMoreData(IndexedSliceReader.java:350)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:142)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:44)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
 ~[guava-16.0.jar:na]
 at
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
 ~[guava-16.0.jar:na]
 at
 org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:82)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:172)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:155)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:146)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:125)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
 ~[guava-16.0.jar:na]
 at
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
 ~[guava-16.0.jar:na]
 at
 org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:203)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:107)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:81)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:69)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:320)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 

Re: OOM and high SSTables count

2015-03-04 Thread Patrick McFadin
What kind of disks are you running here? Are you getting a lot of GC before
the OOM?

Patrick

On Wed, Mar 4, 2015 at 9:26 AM, Jan cne...@yahoo.com wrote:

 HI Roni;

 You mentioned:
 DC1 servers have 32GB of RAM and 10GB of HEAP. DC2 machines have 16GB of
 RAM and 5GB HEAP.

 Best practices would be be to:
 a)  have a consistent type of node across both DC's.  (CPUs, Memory, Heap
  Disk)
 b)  increase heap on DC2 servers to be  8GB for C* Heap

 The leveled compaction issue is not addressed by this.
 hope this helps

 Jan/




   On Wednesday, March 4, 2015 8:41 AM, Roni Balthazar 
 ronibaltha...@gmail.com wrote:


 Hi there,

 We are running C* 2.1.3 cluster with 2 DataCenters: DC1: 30 Servers /
 DC2 - 10 Servers.
 DC1 servers have 32GB of RAM and 10GB of HEAP. DC2 machines have 16GB
 of RAM and 5GB HEAP.
 DC1 nodes have about 1.4TB of data and DC2 nodes 2.3TB.
 DC2 is used only for backup purposes. There are no reads on DC2.
 Every writes and reads are on DC1 using LOCAL_ONE and the RF DC1: 2 and
 DC2: 1.
 All keyspaces have STCS (Average 20~30 SSTables count each table on
 both DCs) except one that is using LCS (DC1: Avg 4K~7K SSTables / DC2:
 Avg 3K~14K SSTables).

 Basically we are running into 2 problems:

 1) High SSTables count on keyspace using LCS (This KS has 500GB~600GB
 of data on each DC1 node).
 2) There are 2 servers on DC1 and 4 servers in DC2 that went down with
 the OOM error message below:

 ERROR [SharedPool-Worker-111] 2015-03-04 05:03:26,394
 JVMStabilityInspector.java:94 - JVM state determined to be unstable.
 Exiting forcefully due to:
 java.lang.OutOfMemoryError: Java heap space
 at
 org.apache.cassandra.db.composites.CompoundSparseCellNameType.copyAndMakeWith(CompoundSparseCellNameType.java:186)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.composites.AbstractCompoundCellNameType$CompositeDeserializer.readNext(AbstractCompoundCellNameType.java:286)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.AtomDeserializer.readNext(AtomDeserializer.java:104)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:426)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.fetchMoreData(IndexedSliceReader.java:350)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:142)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:44)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
 ~[guava-16.0.jar:na]
 at
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
 ~[guava-16.0.jar:na]
 at
 org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:82)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:172)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:155)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:146)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:125)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
 ~[guava-16.0.jar:na]
 at
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
 ~[guava-16.0.jar:na]
 at
 org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:203)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:107)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:81)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:69)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:320)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:62)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1915)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 

Re: OOM and high SSTables count

2015-03-04 Thread graham sanderson
We can confirm a problem on 2.1.3 (sadly our beta sstable state obviously did 
not match our production ones in some critical way)

We have about 20k sstables on each of 6 nodes right now; actually a quick 
glance shows 15k of those are from OpsCenter, which may have something to do 
with beta/production mismatch

I will look into the open OOM JIRA issue against 2.1.3 - we may being penalized 
for heavy use of JBOD (x7 per node)

It also looks like 2.1.3 is leaking memory, though it eventually recovers via 
GCInspector causing a complete memtable flush.

 On Mar 4, 2015, at 12:31 PM, daemeon reiydelle daeme...@gmail.com wrote:
 
 Are you finding a correlation between the shards on the OOM DC1 nodes and the 
 OOM DC2 nodes? Does your monitoring tool indicate that the DC1 nodes are 
 using significantly more CPU (and memory) than the nodes that are NOT 
 failing? I am leading you down the path to suspect that your sharding is 
 giving you hot spots. Also are you using vnodes?
 
 Patrick
 
 On Wed, Mar 4, 2015 at 9:26 AM, Jan cne...@yahoo.com 
 mailto:cne...@yahoo.com wrote:
 HI Roni; 
 
 You mentioned: 
 DC1 servers have 32GB of RAM and 10GB of HEAP. DC2 machines have 16GB of RAM 
 and 5GB HEAP.
 
 Best practices would be be to:
 a)  have a consistent type of node across both DC's.  (CPUs, Memory, Heap  
 Disk)
 b)  increase heap on DC2 servers to be  8GB for C* Heap 
 
 The leveled compaction issue is not addressed by this. 
 hope this helps
 
 Jan/
 
 
 
 
 On Wednesday, March 4, 2015 8:41 AM, Roni Balthazar ronibaltha...@gmail.com 
 mailto:ronibaltha...@gmail.com wrote:
 
 
 Hi there,
 
 We are running C* 2.1.3 cluster with 2 DataCenters: DC1: 30 Servers /
 DC2 - 10 Servers.
 DC1 servers have 32GB of RAM and 10GB of HEAP. DC2 machines have 16GB
 of RAM and 5GB HEAP.
 DC1 nodes have about 1.4TB of data and DC2 nodes 2.3TB.
 DC2 is used only for backup purposes. There are no reads on DC2.
 Every writes and reads are on DC1 using LOCAL_ONE and the RF DC1: 2 and DC2: 
 1.
 All keyspaces have STCS (Average 20~30 SSTables count each table on
 both DCs) except one that is using LCS (DC1: Avg 4K~7K SSTables / DC2:
 Avg 3K~14K SSTables).
 
 Basically we are running into 2 problems:
 
 1) High SSTables count on keyspace using LCS (This KS has 500GB~600GB
 of data on each DC1 node).
 2) There are 2 servers on DC1 and 4 servers in DC2 that went down with
 the OOM error message below:
 
 ERROR [SharedPool-Worker-111] 2015-03-04 05:03:26,394
 JVMStabilityInspector.java:94 - JVM state determined to be unstable.
 Exiting forcefully due to:
 java.lang.OutOfMemoryError: Java heap space
 at 
 org.apache.cassandra.db.composites.CompoundSparseCellNameType.copyAndMakeWith(CompoundSparseCellNameType.java:186)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.db.composites.AbstractCompoundCellNameType$CompositeDeserializer.readNext(AbstractCompoundCellNameType.java:286)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.db.AtomDeserializer.readNext(AtomDeserializer.java:104)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:426)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.fetchMoreData(IndexedSliceReader.java:350)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:142)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:44)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
 ~[guava-16.0.jar:na]
 at 
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
 ~[guava-16.0.jar:na]
 at 
 org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:82)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:172)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:155)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:146)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:125)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
 ~[guava-16.0.jar:na]
 at 
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
 

Re: OOM and high SSTables count

2015-03-04 Thread J. Ryan Earl
We think it is this bug:
https://issues.apache.org/jira/browse/CASSANDRA-8860

We're rolling a patch to beta before rolling it into production.

On Wed, Mar 4, 2015 at 4:12 PM, graham sanderson gra...@vast.com wrote:

 We can confirm a problem on 2.1.3 (sadly our beta sstable state obviously
 did not match our production ones in some critical way)

 We have about 20k sstables on each of 6 nodes right now; actually a quick
 glance shows 15k of those are from OpsCenter, which may have something to
 do with beta/production mismatch

 I will look into the open OOM JIRA issue against 2.1.3 - we may being
 penalized for heavy use of JBOD (x7 per node)

 It also looks like 2.1.3 is leaking memory, though it eventually recovers
 via GCInspector causing a complete memtable flush.

 On Mar 4, 2015, at 12:31 PM, daemeon reiydelle daeme...@gmail.com wrote:

 Are you finding a correlation between the shards on the OOM DC1 nodes and
 the OOM DC2 nodes? Does your monitoring tool indicate that the DC1 nodes
 are using significantly more CPU (and memory) than the nodes that are NOT
 failing? I am leading you down the path to suspect that your sharding is
 giving you hot spots. Also are you using vnodes?

 Patrick


 On Wed, Mar 4, 2015 at 9:26 AM, Jan cne...@yahoo.com wrote:

 HI Roni;

 You mentioned:
 DC1 servers have 32GB of RAM and 10GB of HEAP. DC2 machines have 16GB of
 RAM and 5GB HEAP.

 Best practices would be be to:
 a)  have a consistent type of node across both DC's.  (CPUs, Memory,
 Heap  Disk)
 b)  increase heap on DC2 servers to be  8GB for C* Heap

 The leveled compaction issue is not addressed by this.
 hope this helps

 Jan/




   On Wednesday, March 4, 2015 8:41 AM, Roni Balthazar 
 ronibaltha...@gmail.com wrote:


 Hi there,

 We are running C* 2.1.3 cluster with 2 DataCenters: DC1: 30 Servers /
 DC2 - 10 Servers.
 DC1 servers have 32GB of RAM and 10GB of HEAP. DC2 machines have 16GB
 of RAM and 5GB HEAP.
 DC1 nodes have about 1.4TB of data and DC2 nodes 2.3TB.
 DC2 is used only for backup purposes. There are no reads on DC2.
 Every writes and reads are on DC1 using LOCAL_ONE and the RF DC1: 2 and
 DC2: 1.
 All keyspaces have STCS (Average 20~30 SSTables count each table on
 both DCs) except one that is using LCS (DC1: Avg 4K~7K SSTables / DC2:
 Avg 3K~14K SSTables).

 Basically we are running into 2 problems:

 1) High SSTables count on keyspace using LCS (This KS has 500GB~600GB
 of data on each DC1 node).
 2) There are 2 servers on DC1 and 4 servers in DC2 that went down with
 the OOM error message below:

 ERROR [SharedPool-Worker-111] 2015-03-04 05:03:26,394
 JVMStabilityInspector.java:94 - JVM state determined to be unstable.
 Exiting forcefully due to:
 java.lang.OutOfMemoryError: Java heap space
 at
 org.apache.cassandra.db.composites.CompoundSparseCellNameType.copyAndMakeWith(CompoundSparseCellNameType.java:186)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.composites.AbstractCompoundCellNameType$CompositeDeserializer.readNext(AbstractCompoundCellNameType.java:286)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.AtomDeserializer.readNext(AtomDeserializer.java:104)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:426)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.fetchMoreData(IndexedSliceReader.java:350)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:142)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:44)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
 ~[guava-16.0.jar:na]
 at
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
 ~[guava-16.0.jar:na]
 at
 org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:82)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:172)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:155)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:146)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:125)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at
 

Re: OOM at Bootstrap Time

2014-10-30 Thread Maxime
I will give a shot adding the logging.

I've tried some experiments and I have no clue what could be happening
anymore:

I tried setting all nodes to a streamthroughput of 1 except 1, to see if
somehow it was getting overloaded by too many streams coming in at once,
nope.
I went through the source at ColumnFamilyStore.java:856 where the huge
burst of Enqueuing flush... occurs, and it's clearly at the moment
memtables get converted to SSTables on disk. So I started the bootstrap
process and using a bash script trigerred a 'nodetool flush' every minute
during the processes. At first it seemed to work, but again after what
seems to be a locally-trigered cue, the burst (many many thousands of
Enqueuing flush...). But through my previous experiment, I am fairly
certain it's not a question of volume of data coming in (throughput), or
number of SSTables being streamed (dealing at max 150 files pr node).

Does anyone know if such Enqueuing bursts are normal during bootstrap? I'd
like to be able to say it's because my nodes are underpowered, but at the
moment, I'm leaning towards a bug of some kind.

On Wed, Oct 29, 2014 at 3:05 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Some ideas:

 1) Put on DEBUG log on the joining node to see what is going on in details
 with the stream with 1500 files

 2) Check the stream ID to see whether it's a new stream or an old one
 pending



 On Wed, Oct 29, 2014 at 2:21 AM, Maxime maxim...@gmail.com wrote:

 Doan, thanks for the tip, I just read about it this morning, just waiting
 for the new version to pop up on the debian datastax repo.

 Michael, I do believe you are correct in the general running of the
 cluster and I've reset everything.

 So it took me a while to reply, I finally got the SSTables down, as seen
 in the OpsCenter graphs. I'm stumped however because when I bootstrap the
 new node, I still see very large number of files being streamed (~1500 for
 some nodes) and the bootstrap process is failing exactly as it did before,
 in a flury of Enqueuing flush of ...

 Any ideas? I'm reaching the end of what I know I can do, OpsCenter says
 around 32 SStables per CF, but still streaming tons of files. :-/


 On Mon, Oct 27, 2014 at 1:12 PM, DuyHai Doan doanduy...@gmail.com
 wrote:

 Tombstones will be a very important issue for me since the dataset is
 very much a rolling dataset using TTLs heavily.

 -- You can try the new DateTiered compaction strategy (
 https://issues.apache.org/jira/browse/CASSANDRA-6602) released on 2.1.1
 if you have a time series data model to eliminate tombstones

 On Mon, Oct 27, 2014 at 5:47 PM, Laing, Michael 
 michael.la...@nytimes.com wrote:

 Again, from our experience w 2.0.x:

 Revert to the defaults - you are manually setting heap way too high
 IMHO.

 On our small nodes we tried LCS - way too much compaction - switch all
 CFs to STCS.

 We do a major rolling compaction on our small nodes weekly during less
 busy hours - works great. Be sure you have enough disk.

 We never explicitly delete and only use ttls or truncation. You can set
 GC to 0 in that case, so tombstones are more readily expunged. There are a
 couple threads in the list that discuss this... also normal rolling repair
 becomes optional, reducing load (still repair if something unusual happens
 tho...).

 In your current situation, you need to kickstart compaction - are there
 any CFs you can truncate at least temporarily? Then try compacting a small
 CF, then another, etc.

 Hopefully you can get enough headroom to add a node.

 ml




 On Sun, Oct 26, 2014 at 6:24 PM, Maxime maxim...@gmail.com wrote:

 Hmm, thanks for the reading.

 I initially followed some (perhaps too old) maintenance scripts, which
 included weekly 'nodetool compact'. Is there a way for me to undo the
 damage? Tombstones will be a very important issue for me since the dataset
 is very much a rolling dataset using TTLs heavily.

 On Sun, Oct 26, 2014 at 6:04 PM, DuyHai Doan doanduy...@gmail.com
 wrote:

 Should doing a major compaction on those nodes lead to a restructuration
 of the SSTables? -- Beware of the major compaction on SizeTiered, it 
 will
 create 2 giant SSTables and the expired/outdated/tombstone columns in 
 this
 big file will be never cleaned since the SSTable will never get a chance 
 to
 be compacted again

 Essentially to reduce the fragmentation of small SSTables you can
 stay with SizeTiered compaction and play around with compaction 
 properties
 (the thresholds) to make C* group a bunch of files each time it compacts 
 so
 that the file number shrinks to a reasonable count

 Since you're using C* 2.1 and anti-compaction has been introduced, I
 hesitate advising you to use Leveled compaction as a work-around to 
 reduce
 SSTable count.

  Things are a little bit more complicated because of the incremental
 repair process (I don't know whether you're using incremental repair or 
 not
 in production). The Dev blog says that Leveled compaction is performed 
 only
 on repaired 

Re: OOM at Bootstrap Time

2014-10-30 Thread Maxime
I've been trying to go through the logs but I can't say I understand very
well the details:

INFO  [SlabPoolCleaner] 2014-10-30 19:20:18,446 ColumnFamilyStore.java:856
- Enqueuing flush of loc: 7977119 (1%) on-heap, 0 (0%) off-heap
DEBUG [SharedPool-Worker-22] 2014-10-30 19:20:18,446
AbstractSimplePerColumnSecondaryIndex.java:124 - applying index row
2c95cbbb61fb8ec3bd06d70058bfa236ccad5195e48fd00c056f7e1e3fdd4368 in
ColumnFamily(loc.loc_id_idx [66652e312e31332e3830:0:false:0@1414696815026000
!63072000,])
DEBUG [SharedPool-Worker-6] 2014-10-30 19:20:18,446
AbstractSimplePerColumnSecondaryIndex.java:124 - applying index row
41fc260427a88d2f084971702fdcb32756e0731c6042f93e9761e03db5197990 in
ColumnFamily(loc.loc_id_idx [66652e312e31332e3830:0:false:0@1414696815333000
!63072000,])
DEBUG [SharedPool-Worker-25] 2014-10-30 19:20:18,446
AbstractSimplePerColumnSecondaryIndex.java:124 - applying index row
2e8c4dab33faade0a4fc265e4126e43dc2e58fb72830f73d7e9b8e836101d413 in
ColumnFamily(loc.loc_id_idx [66652e312e31332e3830:0:false:0@1414696815335000
!63072000,])
DEBUG [SharedPool-Worker-26] 2014-10-30 19:20:18,446
AbstractSimplePerColumnSecondaryIndex.java:124 - applying index row
245bec68c5820364a72db093d5c9899b631e692006881c98f0abf4da5fbff4cd in
ColumnFamily(loc.loc_id_idx [66652e312e31332e3830:0:false:0@1414696815344000
!63072000,])
DEBUG [SharedPool-Worker-20] 2014-10-30 19:20:18,446
AbstractSimplePerColumnSecondaryIndex.java:124 - applying index row
ea8dfb47177bd40f46aac4fe41d3cfea3316cf35451ace0825f46b6e0fa9e3ef in
ColumnFamily(loc.loc_id_idx [66652e312e31332e3830:0:false:0@1414696815262000
!63072000,])

This is a sample of Enqueuing flush events in the storm.

On Thu, Oct 30, 2014 at 12:20 PM, Maxime maxim...@gmail.com wrote:

 I will give a shot adding the logging.

 I've tried some experiments and I have no clue what could be happening
 anymore:

 I tried setting all nodes to a streamthroughput of 1 except 1, to see if
 somehow it was getting overloaded by too many streams coming in at once,
 nope.
 I went through the source at ColumnFamilyStore.java:856 where the huge
 burst of Enqueuing flush... occurs, and it's clearly at the moment
 memtables get converted to SSTables on disk. So I started the bootstrap
 process and using a bash script trigerred a 'nodetool flush' every minute
 during the processes. At first it seemed to work, but again after what
 seems to be a locally-trigered cue, the burst (many many thousands of
 Enqueuing flush...). But through my previous experiment, I am fairly
 certain it's not a question of volume of data coming in (throughput), or
 number of SSTables being streamed (dealing at max 150 files pr node).

 Does anyone know if such Enqueuing bursts are normal during bootstrap? I'd
 like to be able to say it's because my nodes are underpowered, but at the
 moment, I'm leaning towards a bug of some kind.

 On Wed, Oct 29, 2014 at 3:05 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Some ideas:

 1) Put on DEBUG log on the joining node to see what is going on in
 details with the stream with 1500 files

 2) Check the stream ID to see whether it's a new stream or an old one
 pending



 On Wed, Oct 29, 2014 at 2:21 AM, Maxime maxim...@gmail.com wrote:

 Doan, thanks for the tip, I just read about it this morning, just
 waiting for the new version to pop up on the debian datastax repo.

 Michael, I do believe you are correct in the general running of the
 cluster and I've reset everything.

 So it took me a while to reply, I finally got the SSTables down, as seen
 in the OpsCenter graphs. I'm stumped however because when I bootstrap the
 new node, I still see very large number of files being streamed (~1500 for
 some nodes) and the bootstrap process is failing exactly as it did before,
 in a flury of Enqueuing flush of ...

 Any ideas? I'm reaching the end of what I know I can do, OpsCenter says
 around 32 SStables per CF, but still streaming tons of files. :-/


 On Mon, Oct 27, 2014 at 1:12 PM, DuyHai Doan doanduy...@gmail.com
 wrote:

 Tombstones will be a very important issue for me since the dataset is
 very much a rolling dataset using TTLs heavily.

 -- You can try the new DateTiered compaction strategy (
 https://issues.apache.org/jira/browse/CASSANDRA-6602) released on
 2.1.1 if you have a time series data model to eliminate tombstones

 On Mon, Oct 27, 2014 at 5:47 PM, Laing, Michael 
 michael.la...@nytimes.com wrote:

 Again, from our experience w 2.0.x:

 Revert to the defaults - you are manually setting heap way too high
 IMHO.

 On our small nodes we tried LCS - way too much compaction - switch all
 CFs to STCS.

 We do a major rolling compaction on our small nodes weekly during less
 busy hours - works great. Be sure you have enough disk.

 We never explicitly delete and only use ttls or truncation. You can
 set GC to 0 in that case, so tombstones are more readily expunged. There
 are a couple threads in the list that discuss this... also normal rolling
 repair becomes 

Re: OOM at Bootstrap Time

2014-10-30 Thread Maxime
Well, the answer was Secondary indexes. I am guessing they were corrupted
somehow. I dropped all of them, cleanup, and now nodes are bootstrapping
fine.

On Thu, Oct 30, 2014 at 3:50 PM, Maxime maxim...@gmail.com wrote:

 I've been trying to go through the logs but I can't say I understand very
 well the details:

 INFO  [SlabPoolCleaner] 2014-10-30 19:20:18,446 ColumnFamilyStore.java:856
 - Enqueuing flush of loc: 7977119 (1%) on-heap, 0 (0%) off-heap
 DEBUG [SharedPool-Worker-22] 2014-10-30 19:20:18,446
 AbstractSimplePerColumnSecondaryIndex.java:124 - applying index row
 2c95cbbb61fb8ec3bd06d70058bfa236ccad5195e48fd00c056f7e1e3fdd4368 in
 ColumnFamily(loc.loc_id_idx [66652e312e31332e3830:0:false:0@1414696815026000
 !63072000,])
 DEBUG [SharedPool-Worker-6] 2014-10-30 19:20:18,446
 AbstractSimplePerColumnSecondaryIndex.java:124 - applying index row
 41fc260427a88d2f084971702fdcb32756e0731c6042f93e9761e03db5197990 in
 ColumnFamily(loc.loc_id_idx [66652e312e31332e3830:0:false:0@1414696815333000
 !63072000,])
 DEBUG [SharedPool-Worker-25] 2014-10-30 19:20:18,446
 AbstractSimplePerColumnSecondaryIndex.java:124 - applying index row
 2e8c4dab33faade0a4fc265e4126e43dc2e58fb72830f73d7e9b8e836101d413 in
 ColumnFamily(loc.loc_id_idx [66652e312e31332e3830:0:false:0@1414696815335000
 !63072000,])
 DEBUG [SharedPool-Worker-26] 2014-10-30 19:20:18,446
 AbstractSimplePerColumnSecondaryIndex.java:124 - applying index row
 245bec68c5820364a72db093d5c9899b631e692006881c98f0abf4da5fbff4cd in
 ColumnFamily(loc.loc_id_idx [66652e312e31332e3830:0:false:0@1414696815344000
 !63072000,])
 DEBUG [SharedPool-Worker-20] 2014-10-30 19:20:18,446
 AbstractSimplePerColumnSecondaryIndex.java:124 - applying index row
 ea8dfb47177bd40f46aac4fe41d3cfea3316cf35451ace0825f46b6e0fa9e3ef in
 ColumnFamily(loc.loc_id_idx [66652e312e31332e3830:0:false:0@1414696815262000
 !63072000,])

 This is a sample of Enqueuing flush events in the storm.

 On Thu, Oct 30, 2014 at 12:20 PM, Maxime maxim...@gmail.com wrote:

 I will give a shot adding the logging.

 I've tried some experiments and I have no clue what could be happening
 anymore:

 I tried setting all nodes to a streamthroughput of 1 except 1, to see if
 somehow it was getting overloaded by too many streams coming in at once,
 nope.
 I went through the source at ColumnFamilyStore.java:856 where the huge
 burst of Enqueuing flush... occurs, and it's clearly at the moment
 memtables get converted to SSTables on disk. So I started the bootstrap
 process and using a bash script trigerred a 'nodetool flush' every minute
 during the processes. At first it seemed to work, but again after what
 seems to be a locally-trigered cue, the burst (many many thousands of
 Enqueuing flush...). But through my previous experiment, I am fairly
 certain it's not a question of volume of data coming in (throughput), or
 number of SSTables being streamed (dealing at max 150 files pr node).

 Does anyone know if such Enqueuing bursts are normal during bootstrap?
 I'd like to be able to say it's because my nodes are underpowered, but at
 the moment, I'm leaning towards a bug of some kind.

 On Wed, Oct 29, 2014 at 3:05 PM, DuyHai Doan doanduy...@gmail.com
 wrote:

 Some ideas:

 1) Put on DEBUG log on the joining node to see what is going on in
 details with the stream with 1500 files

 2) Check the stream ID to see whether it's a new stream or an old one
 pending



 On Wed, Oct 29, 2014 at 2:21 AM, Maxime maxim...@gmail.com wrote:

 Doan, thanks for the tip, I just read about it this morning, just
 waiting for the new version to pop up on the debian datastax repo.

 Michael, I do believe you are correct in the general running of the
 cluster and I've reset everything.

 So it took me a while to reply, I finally got the SSTables down, as
 seen in the OpsCenter graphs. I'm stumped however because when I bootstrap
 the new node, I still see very large number of files being streamed (~1500
 for some nodes) and the bootstrap process is failing exactly as it did
 before, in a flury of Enqueuing flush of ...

 Any ideas? I'm reaching the end of what I know I can do, OpsCenter says
 around 32 SStables per CF, but still streaming tons of files. :-/


 On Mon, Oct 27, 2014 at 1:12 PM, DuyHai Doan doanduy...@gmail.com
 wrote:

 Tombstones will be a very important issue for me since the dataset
 is very much a rolling dataset using TTLs heavily.

 -- You can try the new DateTiered compaction strategy (
 https://issues.apache.org/jira/browse/CASSANDRA-6602) released on
 2.1.1 if you have a time series data model to eliminate tombstones

 On Mon, Oct 27, 2014 at 5:47 PM, Laing, Michael 
 michael.la...@nytimes.com wrote:

 Again, from our experience w 2.0.x:

 Revert to the defaults - you are manually setting heap way too high
 IMHO.

 On our small nodes we tried LCS - way too much compaction - switch
 all CFs to STCS.

 We do a major rolling compaction on our small nodes weekly during
 less busy hours - works great. Be sure you 

Re: OOM at Bootstrap Time

2014-10-29 Thread DuyHai Doan
Some ideas:

1) Put on DEBUG log on the joining node to see what is going on in details
with the stream with 1500 files

2) Check the stream ID to see whether it's a new stream or an old one
pending



On Wed, Oct 29, 2014 at 2:21 AM, Maxime maxim...@gmail.com wrote:

 Doan, thanks for the tip, I just read about it this morning, just waiting
 for the new version to pop up on the debian datastax repo.

 Michael, I do believe you are correct in the general running of the
 cluster and I've reset everything.

 So it took me a while to reply, I finally got the SSTables down, as seen
 in the OpsCenter graphs. I'm stumped however because when I bootstrap the
 new node, I still see very large number of files being streamed (~1500 for
 some nodes) and the bootstrap process is failing exactly as it did before,
 in a flury of Enqueuing flush of ...

 Any ideas? I'm reaching the end of what I know I can do, OpsCenter says
 around 32 SStables per CF, but still streaming tons of files. :-/


 On Mon, Oct 27, 2014 at 1:12 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Tombstones will be a very important issue for me since the dataset is
 very much a rolling dataset using TTLs heavily.

 -- You can try the new DateTiered compaction strategy (
 https://issues.apache.org/jira/browse/CASSANDRA-6602) released on 2.1.1
 if you have a time series data model to eliminate tombstones

 On Mon, Oct 27, 2014 at 5:47 PM, Laing, Michael 
 michael.la...@nytimes.com wrote:

 Again, from our experience w 2.0.x:

 Revert to the defaults - you are manually setting heap way too high IMHO.

 On our small nodes we tried LCS - way too much compaction - switch all
 CFs to STCS.

 We do a major rolling compaction on our small nodes weekly during less
 busy hours - works great. Be sure you have enough disk.

 We never explicitly delete and only use ttls or truncation. You can set
 GC to 0 in that case, so tombstones are more readily expunged. There are a
 couple threads in the list that discuss this... also normal rolling repair
 becomes optional, reducing load (still repair if something unusual happens
 tho...).

 In your current situation, you need to kickstart compaction - are there
 any CFs you can truncate at least temporarily? Then try compacting a small
 CF, then another, etc.

 Hopefully you can get enough headroom to add a node.

 ml




 On Sun, Oct 26, 2014 at 6:24 PM, Maxime maxim...@gmail.com wrote:

 Hmm, thanks for the reading.

 I initially followed some (perhaps too old) maintenance scripts, which
 included weekly 'nodetool compact'. Is there a way for me to undo the
 damage? Tombstones will be a very important issue for me since the dataset
 is very much a rolling dataset using TTLs heavily.

 On Sun, Oct 26, 2014 at 6:04 PM, DuyHai Doan doanduy...@gmail.com
 wrote:

 Should doing a major compaction on those nodes lead to a restructuration
 of the SSTables? -- Beware of the major compaction on SizeTiered, it 
 will
 create 2 giant SSTables and the expired/outdated/tombstone columns in this
 big file will be never cleaned since the SSTable will never get a chance 
 to
 be compacted again

 Essentially to reduce the fragmentation of small SSTables you can stay
 with SizeTiered compaction and play around with compaction properties (the
 thresholds) to make C* group a bunch of files each time it compacts so 
 that
 the file number shrinks to a reasonable count

 Since you're using C* 2.1 and anti-compaction has been introduced, I
 hesitate advising you to use Leveled compaction as a work-around to reduce
 SSTable count.

  Things are a little bit more complicated because of the incremental
 repair process (I don't know whether you're using incremental repair or 
 not
 in production). The Dev blog says that Leveled compaction is performed 
 only
 on repaired SSTables, the un-repaired ones still use SizeTiered, more
 details here:
 http://www.datastax.com/dev/blog/anticompaction-in-cassandra-2-1

 Regards





 On Sun, Oct 26, 2014 at 9:44 PM, Jonathan Haddad j...@jonhaddad.com
 wrote:

 If the issue is related to I/O, you're going to want to determine if
 you're saturated.  Take a look at `iostat -dmx 1`, you'll see avgqu-sz
 (queue size) and svctm, (service time).The higher those numbers
 are, the most overwhelmed your disk is.

 On Sun, Oct 26, 2014 at 12:01 PM, DuyHai Doan doanduy...@gmail.com
 wrote:
  Hello Maxime
 
  Increasing the flush writers won't help if your disk I/O is not
 keeping up.
 
  I've had a look into the log file, below are some remarks:
 
  1) There are a lot of SSTables on disk for some tables (events for
 example,
  but not only). I've seen that some compactions are taking up to 32
 SSTables
  (which corresponds to the default max value for SizeTiered
 compaction).
 
  2) There is a secondary index that I found suspicious :
 loc.loc_id_idx. As
  its name implies I have the impression that it's an index on the id
 of the
  loc which would lead to almost an 1-1 relationship between the
 indexed value
  and 

Re: OOM at Bootstrap Time

2014-10-28 Thread Maxime
Doan, thanks for the tip, I just read about it this morning, just waiting
for the new version to pop up on the debian datastax repo.

Michael, I do believe you are correct in the general running of the cluster
and I've reset everything.

So it took me a while to reply, I finally got the SSTables down, as seen in
the OpsCenter graphs. I'm stumped however because when I bootstrap the new
node, I still see very large number of files being streamed (~1500 for some
nodes) and the bootstrap process is failing exactly as it did before, in a
flury of Enqueuing flush of ...

Any ideas? I'm reaching the end of what I know I can do, OpsCenter says
around 32 SStables per CF, but still streaming tons of files. :-/


On Mon, Oct 27, 2014 at 1:12 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Tombstones will be a very important issue for me since the dataset is
 very much a rolling dataset using TTLs heavily.

 -- You can try the new DateTiered compaction strategy (
 https://issues.apache.org/jira/browse/CASSANDRA-6602) released on 2.1.1
 if you have a time series data model to eliminate tombstones

 On Mon, Oct 27, 2014 at 5:47 PM, Laing, Michael michael.la...@nytimes.com
  wrote:

 Again, from our experience w 2.0.x:

 Revert to the defaults - you are manually setting heap way too high IMHO.

 On our small nodes we tried LCS - way too much compaction - switch all
 CFs to STCS.

 We do a major rolling compaction on our small nodes weekly during less
 busy hours - works great. Be sure you have enough disk.

 We never explicitly delete and only use ttls or truncation. You can set
 GC to 0 in that case, so tombstones are more readily expunged. There are a
 couple threads in the list that discuss this... also normal rolling repair
 becomes optional, reducing load (still repair if something unusual happens
 tho...).

 In your current situation, you need to kickstart compaction - are there
 any CFs you can truncate at least temporarily? Then try compacting a small
 CF, then another, etc.

 Hopefully you can get enough headroom to add a node.

 ml




 On Sun, Oct 26, 2014 at 6:24 PM, Maxime maxim...@gmail.com wrote:

 Hmm, thanks for the reading.

 I initially followed some (perhaps too old) maintenance scripts, which
 included weekly 'nodetool compact'. Is there a way for me to undo the
 damage? Tombstones will be a very important issue for me since the dataset
 is very much a rolling dataset using TTLs heavily.

 On Sun, Oct 26, 2014 at 6:04 PM, DuyHai Doan doanduy...@gmail.com
 wrote:

 Should doing a major compaction on those nodes lead to a restructuration
 of the SSTables? -- Beware of the major compaction on SizeTiered, it will
 create 2 giant SSTables and the expired/outdated/tombstone columns in this
 big file will be never cleaned since the SSTable will never get a chance to
 be compacted again

 Essentially to reduce the fragmentation of small SSTables you can stay
 with SizeTiered compaction and play around with compaction properties (the
 thresholds) to make C* group a bunch of files each time it compacts so that
 the file number shrinks to a reasonable count

 Since you're using C* 2.1 and anti-compaction has been introduced, I
 hesitate advising you to use Leveled compaction as a work-around to reduce
 SSTable count.

  Things are a little bit more complicated because of the incremental
 repair process (I don't know whether you're using incremental repair or not
 in production). The Dev blog says that Leveled compaction is performed only
 on repaired SSTables, the un-repaired ones still use SizeTiered, more
 details here:
 http://www.datastax.com/dev/blog/anticompaction-in-cassandra-2-1

 Regards





 On Sun, Oct 26, 2014 at 9:44 PM, Jonathan Haddad j...@jonhaddad.com
 wrote:

 If the issue is related to I/O, you're going to want to determine if
 you're saturated.  Take a look at `iostat -dmx 1`, you'll see avgqu-sz
 (queue size) and svctm, (service time).The higher those numbers
 are, the most overwhelmed your disk is.

 On Sun, Oct 26, 2014 at 12:01 PM, DuyHai Doan doanduy...@gmail.com
 wrote:
  Hello Maxime
 
  Increasing the flush writers won't help if your disk I/O is not
 keeping up.
 
  I've had a look into the log file, below are some remarks:
 
  1) There are a lot of SSTables on disk for some tables (events for
 example,
  but not only). I've seen that some compactions are taking up to 32
 SSTables
  (which corresponds to the default max value for SizeTiered
 compaction).
 
  2) There is a secondary index that I found suspicious :
 loc.loc_id_idx. As
  its name implies I have the impression that it's an index on the id
 of the
  loc which would lead to almost an 1-1 relationship between the
 indexed value
  and the original loc. Such index should be avoided because they do
 not
  perform well. If it's not an index on the loc_id, please disregard
 my remark
 
  3) There is a clear imbalance of SSTable count on some nodes. In the
 log, I
  saw:
 
  INFO  [STREAM-IN-/...20] 2014-10-25 

Re: OOM at Bootstrap Time

2014-10-27 Thread Laing, Michael
Again, from our experience w 2.0.x:

Revert to the defaults - you are manually setting heap way too high IMHO.

On our small nodes we tried LCS - way too much compaction - switch all CFs
to STCS.

We do a major rolling compaction on our small nodes weekly during less busy
hours - works great. Be sure you have enough disk.

We never explicitly delete and only use ttls or truncation. You can set GC
to 0 in that case, so tombstones are more readily expunged. There are a
couple threads in the list that discuss this... also normal rolling repair
becomes optional, reducing load (still repair if something unusual happens
tho...).

In your current situation, you need to kickstart compaction - are there any
CFs you can truncate at least temporarily? Then try compacting a small CF,
then another, etc.

Hopefully you can get enough headroom to add a node.

ml




On Sun, Oct 26, 2014 at 6:24 PM, Maxime maxim...@gmail.com wrote:

 Hmm, thanks for the reading.

 I initially followed some (perhaps too old) maintenance scripts, which
 included weekly 'nodetool compact'. Is there a way for me to undo the
 damage? Tombstones will be a very important issue for me since the dataset
 is very much a rolling dataset using TTLs heavily.

 On Sun, Oct 26, 2014 at 6:04 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Should doing a major compaction on those nodes lead to a restructuration
 of the SSTables? -- Beware of the major compaction on SizeTiered, it will
 create 2 giant SSTables and the expired/outdated/tombstone columns in this
 big file will be never cleaned since the SSTable will never get a chance to
 be compacted again

 Essentially to reduce the fragmentation of small SSTables you can stay
 with SizeTiered compaction and play around with compaction properties (the
 thresholds) to make C* group a bunch of files each time it compacts so that
 the file number shrinks to a reasonable count

 Since you're using C* 2.1 and anti-compaction has been introduced, I
 hesitate advising you to use Leveled compaction as a work-around to reduce
 SSTable count.

  Things are a little bit more complicated because of the incremental
 repair process (I don't know whether you're using incremental repair or not
 in production). The Dev blog says that Leveled compaction is performed only
 on repaired SSTables, the un-repaired ones still use SizeTiered, more
 details here:
 http://www.datastax.com/dev/blog/anticompaction-in-cassandra-2-1

 Regards





 On Sun, Oct 26, 2014 at 9:44 PM, Jonathan Haddad j...@jonhaddad.com
 wrote:

 If the issue is related to I/O, you're going to want to determine if
 you're saturated.  Take a look at `iostat -dmx 1`, you'll see avgqu-sz
 (queue size) and svctm, (service time).The higher those numbers
 are, the most overwhelmed your disk is.

 On Sun, Oct 26, 2014 at 12:01 PM, DuyHai Doan doanduy...@gmail.com
 wrote:
  Hello Maxime
 
  Increasing the flush writers won't help if your disk I/O is not
 keeping up.
 
  I've had a look into the log file, below are some remarks:
 
  1) There are a lot of SSTables on disk for some tables (events for
 example,
  but not only). I've seen that some compactions are taking up to 32
 SSTables
  (which corresponds to the default max value for SizeTiered compaction).
 
  2) There is a secondary index that I found suspicious :
 loc.loc_id_idx. As
  its name implies I have the impression that it's an index on the id of
 the
  loc which would lead to almost an 1-1 relationship between the indexed
 value
  and the original loc. Such index should be avoided because they do not
  perform well. If it's not an index on the loc_id, please disregard my
 remark
 
  3) There is a clear imbalance of SSTable count on some nodes. In the
 log, I
  saw:
 
  INFO  [STREAM-IN-/...20] 2014-10-25 02:21:43,360
  StreamResultFuture.java:166 - [Stream
 #a6e54ea0-5bed-11e4-8df5-f357715e1a79
  ID#0] Prepare completed. Receiving 163 files(4 111 187 195 bytes),
 sending 0
  files(0 bytes)
 
  INFO  [STREAM-IN-/...81] 2014-10-25 02:21:46,121
  StreamResultFuture.java:166 - [Stream
 #a6e54ea0-5bed-11e4-8df5-f357715e1a79
  ID#0] Prepare completed. Receiving 154 files(3 332 779 920 bytes),
 sending 0
  files(0 bytes)
 
  INFO  [STREAM-IN-/...71] 2014-10-25 02:21:50,494
  StreamResultFuture.java:166 - [Stream
 #a6e54ea0-5bed-11e4-8df5-f357715e1a79
  ID#0] Prepare completed. Receiving 1315 files(4 606 316 933 bytes),
 sending
  0 files(0 bytes)
 
  INFO  [STREAM-IN-/...217] 2014-10-25 02:21:51,036
  StreamResultFuture.java:166 - [Stream
 #a6e54ea0-5bed-11e4-8df5-f357715e1a79
  ID#0] Prepare completed. Receiving 1640 files(3 208 023 573 bytes),
 sending
  0 files(0 bytes)
 
   As you can see, the existing 4 nodes are streaming data to the new
 node and
  on average the data set size is about 3.3 - 4.5 Gb. However the number
 of
  SSTables is around 150 files for nodes ...20 and
  ...81 but goes through the roof to reach 1315 files for
  

Re: OOM at Bootstrap Time

2014-10-27 Thread DuyHai Doan
Tombstones will be a very important issue for me since the dataset is very
much a rolling dataset using TTLs heavily.

-- You can try the new DateTiered compaction strategy (
https://issues.apache.org/jira/browse/CASSANDRA-6602) released on 2.1.1 if
you have a time series data model to eliminate tombstones

On Mon, Oct 27, 2014 at 5:47 PM, Laing, Michael michael.la...@nytimes.com
wrote:

 Again, from our experience w 2.0.x:

 Revert to the defaults - you are manually setting heap way too high IMHO.

 On our small nodes we tried LCS - way too much compaction - switch all CFs
 to STCS.

 We do a major rolling compaction on our small nodes weekly during less
 busy hours - works great. Be sure you have enough disk.

 We never explicitly delete and only use ttls or truncation. You can set GC
 to 0 in that case, so tombstones are more readily expunged. There are a
 couple threads in the list that discuss this... also normal rolling repair
 becomes optional, reducing load (still repair if something unusual happens
 tho...).

 In your current situation, you need to kickstart compaction - are there
 any CFs you can truncate at least temporarily? Then try compacting a small
 CF, then another, etc.

 Hopefully you can get enough headroom to add a node.

 ml




 On Sun, Oct 26, 2014 at 6:24 PM, Maxime maxim...@gmail.com wrote:

 Hmm, thanks for the reading.

 I initially followed some (perhaps too old) maintenance scripts, which
 included weekly 'nodetool compact'. Is there a way for me to undo the
 damage? Tombstones will be a very important issue for me since the dataset
 is very much a rolling dataset using TTLs heavily.

 On Sun, Oct 26, 2014 at 6:04 PM, DuyHai Doan doanduy...@gmail.com
 wrote:

 Should doing a major compaction on those nodes lead to a restructuration
 of the SSTables? -- Beware of the major compaction on SizeTiered, it will
 create 2 giant SSTables and the expired/outdated/tombstone columns in this
 big file will be never cleaned since the SSTable will never get a chance to
 be compacted again

 Essentially to reduce the fragmentation of small SSTables you can stay
 with SizeTiered compaction and play around with compaction properties (the
 thresholds) to make C* group a bunch of files each time it compacts so that
 the file number shrinks to a reasonable count

 Since you're using C* 2.1 and anti-compaction has been introduced, I
 hesitate advising you to use Leveled compaction as a work-around to reduce
 SSTable count.

  Things are a little bit more complicated because of the incremental
 repair process (I don't know whether you're using incremental repair or not
 in production). The Dev blog says that Leveled compaction is performed only
 on repaired SSTables, the un-repaired ones still use SizeTiered, more
 details here:
 http://www.datastax.com/dev/blog/anticompaction-in-cassandra-2-1

 Regards





 On Sun, Oct 26, 2014 at 9:44 PM, Jonathan Haddad j...@jonhaddad.com
 wrote:

 If the issue is related to I/O, you're going to want to determine if
 you're saturated.  Take a look at `iostat -dmx 1`, you'll see avgqu-sz
 (queue size) and svctm, (service time).The higher those numbers
 are, the most overwhelmed your disk is.

 On Sun, Oct 26, 2014 at 12:01 PM, DuyHai Doan doanduy...@gmail.com
 wrote:
  Hello Maxime
 
  Increasing the flush writers won't help if your disk I/O is not
 keeping up.
 
  I've had a look into the log file, below are some remarks:
 
  1) There are a lot of SSTables on disk for some tables (events for
 example,
  but not only). I've seen that some compactions are taking up to 32
 SSTables
  (which corresponds to the default max value for SizeTiered
 compaction).
 
  2) There is a secondary index that I found suspicious :
 loc.loc_id_idx. As
  its name implies I have the impression that it's an index on the id
 of the
  loc which would lead to almost an 1-1 relationship between the
 indexed value
  and the original loc. Such index should be avoided because they do not
  perform well. If it's not an index on the loc_id, please disregard my
 remark
 
  3) There is a clear imbalance of SSTable count on some nodes. In the
 log, I
  saw:
 
  INFO  [STREAM-IN-/...20] 2014-10-25 02:21:43,360
  StreamResultFuture.java:166 - [Stream
 #a6e54ea0-5bed-11e4-8df5-f357715e1a79
  ID#0] Prepare completed. Receiving 163 files(4 111 187 195 bytes),
 sending 0
  files(0 bytes)
 
  INFO  [STREAM-IN-/...81] 2014-10-25 02:21:46,121
  StreamResultFuture.java:166 - [Stream
 #a6e54ea0-5bed-11e4-8df5-f357715e1a79
  ID#0] Prepare completed. Receiving 154 files(3 332 779 920 bytes),
 sending 0
  files(0 bytes)
 
  INFO  [STREAM-IN-/...71] 2014-10-25 02:21:50,494
  StreamResultFuture.java:166 - [Stream
 #a6e54ea0-5bed-11e4-8df5-f357715e1a79
  ID#0] Prepare completed. Receiving 1315 files(4 606 316 933 bytes),
 sending
  0 files(0 bytes)
 
  INFO  [STREAM-IN-/...217] 2014-10-25 02:21:51,036
  StreamResultFuture.java:166 - [Stream
 

Re: OOM at Bootstrap Time

2014-10-26 Thread DuyHai Doan
Hello Maxime

 Can you put the complete logs and config somewhere ? It would be
interesting to know what is the cause of the OOM.

On Sun, Oct 26, 2014 at 3:15 AM, Maxime maxim...@gmail.com wrote:

 Thanks a lot that is comforting. We are also small at the moment so I
 definitely can relate with the idea of keeping small and simple at a level
 where it just works.

 I see the new Apache version has a lot of fixes so I will try to upgrade
 before I look into downgrading.


 On Saturday, October 25, 2014, Laing, Michael michael.la...@nytimes.com
 wrote:

 Since no one else has stepped in...

 We have run clusters with ridiculously small nodes - I have a production
 cluster in AWS with 4GB nodes each with 1 CPU and disk-based instance
 storage. It works fine but you can see those little puppies struggle...

 And I ran into problems such as you observe...

 Upgrading Java to the latest 1.7 and - most importantly - *reverting to
 the default configuration, esp. for heap*, seemed to settle things down
 completely. Also make sure that you are using the 'recommended production
 settings' from the docs on your boxen.

 However we are running 2.0.x not 2.1.0 so YMMV.

 And we are switching to 15GB nodes w 2 heftier CPUs each and SSD storage
 - still a 'small' machine, but much more reasonable for C*.

 However I can't say I am an expert, since I deliberately keep things so
 simple that we do not encounter problems - it just works so I dig into
 other stuff.

 ml


 On Sat, Oct 25, 2014 at 5:22 PM, Maxime maxim...@gmail.com wrote:

 Hello, I've been trying to add a new node to my cluster ( 4 nodes ) for
 a few days now.

 I started by adding a node similar to my current configuration, 4 GB or
 RAM + 2 Cores on DigitalOcean. However every time, I would end up getting
 OOM errors after many log entries of the type:

 INFO  [SlabPoolCleaner] 2014-10-25 13:44:57,240
 ColumnFamilyStore.java:856 - Enqueuing flush of mycf: 5383 (0%) on-heap, 0
 (0%) off-heap

 leading to:

 ka-120-Data.db (39291 bytes) for commitlog position
 ReplayPosition(segmentId=1414243978538, position=23699418)
 WARN  [SharedPool-Worker-13] 2014-10-25 13:48:18,032
 AbstractTracingAwareExecutorService.java:167 - Uncaught exception on thread
 Thread[SharedPool-Worker-13,5,main]: {}
 java.lang.OutOfMemoryError: Java heap space

 Thinking it had to do with either compaction somehow or streaming, 2
 activities I've had tremendous issues with in the past; I tried to slow
 down the setstreamthroughput to extremely low values all the way to 5. I
 also tried setting setcompactionthoughput to 0, and then reading that in
 some cases it might be too fast, down to 8. Nothing worked, it merely
 vaguely changed the mean time to OOM but not in a way indicating either was
 anywhere a solution.

 The nodes were configured with 2 GB of Heap initially, I tried to crank
 it up to 3 GB, stressing the host memory to its limit.

 After doing some exploration (I am considering writing a Cassandra Ops
 documentation with lessons learned since there seems to be little of it in
 organized fashions), I read that some people had strange issues on
 lower-end boxes like that, so I bit the bullet and upgraded my new node to
 a 8GB + 4 Core instance, which was anecdotally better.

 To my complete shock, exact same issues are present, even raising the
 Heap memory to 6 GB. I figure it can't be a normal situation anymore, but
 must be a bug somehow.

 My cluster is 4 nodes, RF of 2, about 160 GB of data across all nodes.
 About 10 CF of varying sizes. Runtime writes are between 300 to 900 /
 second. Cassandra 2.1.0, nothing too wild.

 Has anyone encountered these kinds of issues before? I would really
 enjoy hearing about the experiences of people trying to run small-sized
 clusters like mine. From everything I read, Cassandra operations go very
 well on large (16 GB + 8 Cores) machines, but I'm sad to report I've had
 nothing but trouble trying to run on smaller machines, perhaps I can learn
 from other's experience?

 Full logs can be provided to anyone interested.

 Cheers





Re: OOM at Bootstrap Time

2014-10-26 Thread Maxime
I've emailed you a raw log file of an instance of this happening.

I've been monitoring more closely the timing of events in tpstats and the
logs and I believe this is what is happening:

- For some reason, C* decides to provoke a flush storm (I say some reason,
I'm sure there is one but I have had difficulty determining the behaviour
changes between 1.* and more recent releases).
- So we see ~ 3000 flush being enqueued.
- This happens so suddenly that even boosting the number of flush writers
to 20 does not suffice. I don't even see all time blocked numbers for it
before C* stops responding. I suspect this is due to the sudden OOM and GC
occurring.
- The last tpstat that comes back before the node goes down indicates 20
active and 3000 pending and the rest 0. It's by far the anomalous activity.

Is there a way to throttle down this generation of Flush? C* complains if I
set the queue_size to any value (deprecated now?) and boosting the threads
does not seem to help since even at 20 we're an order of magnitude off.

Suggestions? Comments?


On Sun, Oct 26, 2014 at 2:26 AM, DuyHai Doan doanduy...@gmail.com wrote:

 Hello Maxime

  Can you put the complete logs and config somewhere ? It would be
 interesting to know what is the cause of the OOM.

 On Sun, Oct 26, 2014 at 3:15 AM, Maxime maxim...@gmail.com wrote:

 Thanks a lot that is comforting. We are also small at the moment so I
 definitely can relate with the idea of keeping small and simple at a level
 where it just works.

 I see the new Apache version has a lot of fixes so I will try to upgrade
 before I look into downgrading.


 On Saturday, October 25, 2014, Laing, Michael michael.la...@nytimes.com
 wrote:

 Since no one else has stepped in...

 We have run clusters with ridiculously small nodes - I have a production
 cluster in AWS with 4GB nodes each with 1 CPU and disk-based instance
 storage. It works fine but you can see those little puppies struggle...

 And I ran into problems such as you observe...

 Upgrading Java to the latest 1.7 and - most importantly - *reverting to
 the default configuration, esp. for heap*, seemed to settle things down
 completely. Also make sure that you are using the 'recommended production
 settings' from the docs on your boxen.

 However we are running 2.0.x not 2.1.0 so YMMV.

 And we are switching to 15GB nodes w 2 heftier CPUs each and SSD storage
 - still a 'small' machine, but much more reasonable for C*.

 However I can't say I am an expert, since I deliberately keep things so
 simple that we do not encounter problems - it just works so I dig into
 other stuff.

 ml


 On Sat, Oct 25, 2014 at 5:22 PM, Maxime maxim...@gmail.com wrote:

 Hello, I've been trying to add a new node to my cluster ( 4 nodes ) for
 a few days now.

 I started by adding a node similar to my current configuration, 4 GB or
 RAM + 2 Cores on DigitalOcean. However every time, I would end up getting
 OOM errors after many log entries of the type:

 INFO  [SlabPoolCleaner] 2014-10-25 13:44:57,240
 ColumnFamilyStore.java:856 - Enqueuing flush of mycf: 5383 (0%) on-heap, 0
 (0%) off-heap

 leading to:

 ka-120-Data.db (39291 bytes) for commitlog position
 ReplayPosition(segmentId=1414243978538, position=23699418)
 WARN  [SharedPool-Worker-13] 2014-10-25 13:48:18,032
 AbstractTracingAwareExecutorService.java:167 - Uncaught exception on thread
 Thread[SharedPool-Worker-13,5,main]: {}
 java.lang.OutOfMemoryError: Java heap space

 Thinking it had to do with either compaction somehow or streaming, 2
 activities I've had tremendous issues with in the past; I tried to slow
 down the setstreamthroughput to extremely low values all the way to 5. I
 also tried setting setcompactionthoughput to 0, and then reading that in
 some cases it might be too fast, down to 8. Nothing worked, it merely
 vaguely changed the mean time to OOM but not in a way indicating either was
 anywhere a solution.

 The nodes were configured with 2 GB of Heap initially, I tried to crank
 it up to 3 GB, stressing the host memory to its limit.

 After doing some exploration (I am considering writing a Cassandra Ops
 documentation with lessons learned since there seems to be little of it in
 organized fashions), I read that some people had strange issues on
 lower-end boxes like that, so I bit the bullet and upgraded my new node to
 a 8GB + 4 Core instance, which was anecdotally better.

 To my complete shock, exact same issues are present, even raising the
 Heap memory to 6 GB. I figure it can't be a normal situation anymore, but
 must be a bug somehow.

 My cluster is 4 nodes, RF of 2, about 160 GB of data across all nodes.
 About 10 CF of varying sizes. Runtime writes are between 300 to 900 /
 second. Cassandra 2.1.0, nothing too wild.

 Has anyone encountered these kinds of issues before? I would really
 enjoy hearing about the experiences of people trying to run small-sized
 clusters like mine. From everything I read, Cassandra operations go very
 well on large (16 GB + 8 

Re: OOM at Bootstrap Time

2014-10-26 Thread DuyHai Doan
Hello Maxime

Increasing the flush writers won't help if your disk I/O is not keeping up.

I've had a look into the log file, below are some remarks:

1) There are a lot of SSTables on disk for some tables (events for example,
but not only). I've seen that some compactions are taking up to 32 SSTables
(which corresponds to the default max value for SizeTiered compaction).

2) There is a secondary index that I found suspicious : loc.loc_id_idx. As
its name implies I have the impression that it's an index on the id of the
loc which would lead to almost an 1-1 relationship between the indexed
value and the original loc. Such index should be avoided because they do
not perform well. If it's not an index on the loc_id, please disregard my
remark

3) There is a clear imbalance of SSTable count on some nodes. In the log, I
saw:

INFO  [STREAM-IN-/...20] 2014-10-25 02:21:43,360
StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79
ID#0] Prepare completed. Receiving *163* files(*4 111 187 195* bytes),
sending 0 files(0 bytes)

INFO  [STREAM-IN-/...81] 2014-10-25 02:21:46,121
StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79
ID#0] Prepare completed. Receiving *154* files(*3 332 779 920* bytes),
sending 0 files(0 bytes)

INFO  [STREAM-IN-/...71] 2014-10-25 02:21:50,494
StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79
ID#0] Prepare completed. Receiving *1315* files(*4 606 316 933* bytes),
sending 0 files(0 bytes)

INFO  [STREAM-IN-/...217] 2014-10-25 02:21:51,036
StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79
ID#0] Prepare completed. Receiving *1640* files(*3 208 023 573* bytes),
sending 0 files(0 bytes)

 As you can see, the existing 4 nodes are streaming data to the new node
and on average the data set size is about 3.3 - 4.5 Gb. However the number
of SSTables is around 150 files for nodes ...20 and
...81 but goes through the roof to reach *1315* files for
...71 and *1640* files for ...217

 The total data set size is roughly the same but the file number is x10,
which mean that you'll have a bunch of tiny files.

 I guess that upon reception of those files, there will be a massive flush
to disk, explaining the behaviour you're facing (flush storm)

I would suggest looking on nodes ...71 and ...217
to check for the total SSTable count for each table to confirm this
intuition

Regards


On Sun, Oct 26, 2014 at 4:58 PM, Maxime maxim...@gmail.com wrote:

 I've emailed you a raw log file of an instance of this happening.

 I've been monitoring more closely the timing of events in tpstats and the
 logs and I believe this is what is happening:

 - For some reason, C* decides to provoke a flush storm (I say some reason,
 I'm sure there is one but I have had difficulty determining the behaviour
 changes between 1.* and more recent releases).
 - So we see ~ 3000 flush being enqueued.
 - This happens so suddenly that even boosting the number of flush writers
 to 20 does not suffice. I don't even see all time blocked numbers for it
 before C* stops responding. I suspect this is due to the sudden OOM and GC
 occurring.
 - The last tpstat that comes back before the node goes down indicates 20
 active and 3000 pending and the rest 0. It's by far the anomalous activity.

 Is there a way to throttle down this generation of Flush? C* complains if
 I set the queue_size to any value (deprecated now?) and boosting the
 threads does not seem to help since even at 20 we're an order of magnitude
 off.

 Suggestions? Comments?


 On Sun, Oct 26, 2014 at 2:26 AM, DuyHai Doan doanduy...@gmail.com wrote:

 Hello Maxime

  Can you put the complete logs and config somewhere ? It would be
 interesting to know what is the cause of the OOM.

 On Sun, Oct 26, 2014 at 3:15 AM, Maxime maxim...@gmail.com wrote:

 Thanks a lot that is comforting. We are also small at the moment so I
 definitely can relate with the idea of keeping small and simple at a level
 where it just works.

 I see the new Apache version has a lot of fixes so I will try to
 upgrade before I look into downgrading.


 On Saturday, October 25, 2014, Laing, Michael michael.la...@nytimes.com
 wrote:

 Since no one else has stepped in...

 We have run clusters with ridiculously small nodes - I have a
 production cluster in AWS with 4GB nodes each with 1 CPU and disk-based
 instance storage. It works fine but you can see those little puppies
 struggle...

 And I ran into problems such as you observe...

 Upgrading Java to the latest 1.7 and - most importantly - *reverting
 to the default configuration, esp. for heap*, seemed to settle things
 down completely. Also make sure that you are using the 'recommended
 production settings' from the docs on your boxen.

 However we are running 2.0.x not 2.1.0 so YMMV.

 And we are switching to 15GB nodes w 2 heftier CPUs each 

Re: OOM at Bootstrap Time

2014-10-26 Thread Maxime
Thank you very much for your reply. This is a deeper interpretation of the
logs than I can do at the moment.

Regarding 2) it's a good assumption on your part but in this case,
non-obviously the loc table's primary key is actually not id, the scheme
changed historically which has led to this odd naming of the field.

What you are describing makes me think it may be related to an odd state
left behind from an experiment I made a few days ago. I switched all tables
from SizeTiered to Level compaction strategy (in an attempt to make better
use of the limited disk space on the machines, compaction was starting to
lead to nodes out of space). Afterwards I reverted a few of the more
write-heavy tables to SizeTiered. The whole experiment seemed shaky.

Should doing a major compaction on those nodes lead to a restructuration of
the SSTables? I would think so.

On Sunday, October 26, 2014, DuyHai Doan doanduy...@gmail.com wrote:

 Hello Maxime

 Increasing the flush writers won't help if your disk I/O is not keeping up.

 I've had a look into the log file, below are some remarks:

 1) There are a lot of SSTables on disk for some tables (events for
 example, but not only). I've seen that some compactions are taking up to 32
 SSTables (which corresponds to the default max value for SizeTiered
 compaction).

 2) There is a secondary index that I found suspicious : loc.loc_id_idx. As
 its name implies I have the impression that it's an index on the id of the
 loc which would lead to almost an 1-1 relationship between the indexed
 value and the original loc. Such index should be avoided because they do
 not perform well. If it's not an index on the loc_id, please disregard my
 remark

 3) There is a clear imbalance of SSTable count on some nodes. In the log,
 I saw:

 INFO  [STREAM-IN-/...20] 2014-10-25 02:21:43,360
 StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79
 ID#0] Prepare completed. Receiving *163* files(*4 111 187 195* bytes),
 sending 0 files(0 bytes)

 INFO  [STREAM-IN-/...81] 2014-10-25 02:21:46,121
 StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79
 ID#0] Prepare completed. Receiving *154* files(*3 332 779 920* bytes),
 sending 0 files(0 bytes)

 INFO  [STREAM-IN-/...71] 2014-10-25 02:21:50,494
 StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79
 ID#0] Prepare completed. Receiving *1315* files(*4 606 316 933* bytes),
 sending 0 files(0 bytes)

 INFO  [STREAM-IN-/...217] 2014-10-25 02:21:51,036
 StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79
 ID#0] Prepare completed. Receiving *1640* files(*3 208 023 573* bytes),
 sending 0 files(0 bytes)

  As you can see, the existing 4 nodes are streaming data to the new node
 and on average the data set size is about 3.3 - 4.5 Gb. However the number
 of SSTables is around 150 files for nodes ...20 and
 ...81 but goes through the roof to reach *1315* files for
 ...71 and *1640* files for ...217

  The total data set size is roughly the same but the file number is x10,
 which mean that you'll have a bunch of tiny files.

  I guess that upon reception of those files, there will be a massive flush
 to disk, explaining the behaviour you're facing (flush storm)

 I would suggest looking on nodes ...71 and ...217
 to check for the total SSTable count for each table to confirm this
 intuition

 Regards


 On Sun, Oct 26, 2014 at 4:58 PM, Maxime maxim...@gmail.com
 javascript:_e(%7B%7D,'cvml','maxim...@gmail.com'); wrote:

 I've emailed you a raw log file of an instance of this happening.

 I've been monitoring more closely the timing of events in tpstats and the
 logs and I believe this is what is happening:

 - For some reason, C* decides to provoke a flush storm (I say some
 reason, I'm sure there is one but I have had difficulty determining the
 behaviour changes between 1.* and more recent releases).
 - So we see ~ 3000 flush being enqueued.
 - This happens so suddenly that even boosting the number of flush writers
 to 20 does not suffice. I don't even see all time blocked numbers for it
 before C* stops responding. I suspect this is due to the sudden OOM and GC
 occurring.
 - The last tpstat that comes back before the node goes down indicates 20
 active and 3000 pending and the rest 0. It's by far the anomalous activity.

 Is there a way to throttle down this generation of Flush? C* complains if
 I set the queue_size to any value (deprecated now?) and boosting the
 threads does not seem to help since even at 20 we're an order of magnitude
 off.

 Suggestions? Comments?


 On Sun, Oct 26, 2014 at 2:26 AM, DuyHai Doan doanduy...@gmail.com
 javascript:_e(%7B%7D,'cvml','doanduy...@gmail.com'); wrote:

 Hello Maxime

  Can you put the complete logs and config somewhere ? It would be
 interesting to know what is the cause of the OOM.

 On Sun, Oct 26, 2014 

Re: OOM at Bootstrap Time

2014-10-26 Thread Jonathan Haddad
If the issue is related to I/O, you're going to want to determine if
you're saturated.  Take a look at `iostat -dmx 1`, you'll see avgqu-sz
(queue size) and svctm, (service time).The higher those numbers
are, the most overwhelmed your disk is.

On Sun, Oct 26, 2014 at 12:01 PM, DuyHai Doan doanduy...@gmail.com wrote:
 Hello Maxime

 Increasing the flush writers won't help if your disk I/O is not keeping up.

 I've had a look into the log file, below are some remarks:

 1) There are a lot of SSTables on disk for some tables (events for example,
 but not only). I've seen that some compactions are taking up to 32 SSTables
 (which corresponds to the default max value for SizeTiered compaction).

 2) There is a secondary index that I found suspicious : loc.loc_id_idx. As
 its name implies I have the impression that it's an index on the id of the
 loc which would lead to almost an 1-1 relationship between the indexed value
 and the original loc. Such index should be avoided because they do not
 perform well. If it's not an index on the loc_id, please disregard my remark

 3) There is a clear imbalance of SSTable count on some nodes. In the log, I
 saw:

 INFO  [STREAM-IN-/...20] 2014-10-25 02:21:43,360
 StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79
 ID#0] Prepare completed. Receiving 163 files(4 111 187 195 bytes), sending 0
 files(0 bytes)

 INFO  [STREAM-IN-/...81] 2014-10-25 02:21:46,121
 StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79
 ID#0] Prepare completed. Receiving 154 files(3 332 779 920 bytes), sending 0
 files(0 bytes)

 INFO  [STREAM-IN-/...71] 2014-10-25 02:21:50,494
 StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79
 ID#0] Prepare completed. Receiving 1315 files(4 606 316 933 bytes), sending
 0 files(0 bytes)

 INFO  [STREAM-IN-/...217] 2014-10-25 02:21:51,036
 StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79
 ID#0] Prepare completed. Receiving 1640 files(3 208 023 573 bytes), sending
 0 files(0 bytes)

  As you can see, the existing 4 nodes are streaming data to the new node and
 on average the data set size is about 3.3 - 4.5 Gb. However the number of
 SSTables is around 150 files for nodes ...20 and
 ...81 but goes through the roof to reach 1315 files for
 ...71 and 1640 files for ...217

  The total data set size is roughly the same but the file number is x10,
 which mean that you'll have a bunch of tiny files.

  I guess that upon reception of those files, there will be a massive flush
 to disk, explaining the behaviour you're facing (flush storm)

 I would suggest looking on nodes ...71 and ...217 to
 check for the total SSTable count for each table to confirm this intuition

 Regards


 On Sun, Oct 26, 2014 at 4:58 PM, Maxime maxim...@gmail.com wrote:

 I've emailed you a raw log file of an instance of this happening.

 I've been monitoring more closely the timing of events in tpstats and the
 logs and I believe this is what is happening:

 - For some reason, C* decides to provoke a flush storm (I say some reason,
 I'm sure there is one but I have had difficulty determining the behaviour
 changes between 1.* and more recent releases).
 - So we see ~ 3000 flush being enqueued.
 - This happens so suddenly that even boosting the number of flush writers
 to 20 does not suffice. I don't even see all time blocked numbers for it
 before C* stops responding. I suspect this is due to the sudden OOM and GC
 occurring.
 - The last tpstat that comes back before the node goes down indicates 20
 active and 3000 pending and the rest 0. It's by far the anomalous activity.

 Is there a way to throttle down this generation of Flush? C* complains if
 I set the queue_size to any value (deprecated now?) and boosting the threads
 does not seem to help since even at 20 we're an order of magnitude off.

 Suggestions? Comments?


 On Sun, Oct 26, 2014 at 2:26 AM, DuyHai Doan doanduy...@gmail.com wrote:

 Hello Maxime

  Can you put the complete logs and config somewhere ? It would be
 interesting to know what is the cause of the OOM.

 On Sun, Oct 26, 2014 at 3:15 AM, Maxime maxim...@gmail.com wrote:

 Thanks a lot that is comforting. We are also small at the moment so I
 definitely can relate with the idea of keeping small and simple at a level
 where it just works.

 I see the new Apache version has a lot of fixes so I will try to upgrade
 before I look into downgrading.


 On Saturday, October 25, 2014, Laing, Michael
 michael.la...@nytimes.com wrote:

 Since no one else has stepped in...

 We have run clusters with ridiculously small nodes - I have a
 production cluster in AWS with 4GB nodes each with 1 CPU and disk-based
 instance storage. It works fine but you can see those little puppies
 struggle...

 And I ran into problems such as you observe...

 Upgrading Java to the 

Re: OOM at Bootstrap Time

2014-10-26 Thread DuyHai Doan
Should doing a major compaction on those nodes lead to a restructuration
of the SSTables? -- Beware of the major compaction on SizeTiered, it will
create 2 giant SSTables and the expired/outdated/tombstone columns in this
big file will be never cleaned since the SSTable will never get a chance to
be compacted again

Essentially to reduce the fragmentation of small SSTables you can stay with
SizeTiered compaction and play around with compaction properties (the
thresholds) to make C* group a bunch of files each time it compacts so that
the file number shrinks to a reasonable count

Since you're using C* 2.1 and anti-compaction has been introduced, I
hesitate advising you to use Leveled compaction as a work-around to reduce
SSTable count.

 Things are a little bit more complicated because of the incremental repair
process (I don't know whether you're using incremental repair or not in
production). The Dev blog says that Leveled compaction is performed only on
repaired SSTables, the un-repaired ones still use SizeTiered, more details
here: http://www.datastax.com/dev/blog/anticompaction-in-cassandra-2-1

Regards





On Sun, Oct 26, 2014 at 9:44 PM, Jonathan Haddad j...@jonhaddad.com wrote:

 If the issue is related to I/O, you're going to want to determine if
 you're saturated.  Take a look at `iostat -dmx 1`, you'll see avgqu-sz
 (queue size) and svctm, (service time).The higher those numbers
 are, the most overwhelmed your disk is.

 On Sun, Oct 26, 2014 at 12:01 PM, DuyHai Doan doanduy...@gmail.com
 wrote:
  Hello Maxime
 
  Increasing the flush writers won't help if your disk I/O is not keeping
 up.
 
  I've had a look into the log file, below are some remarks:
 
  1) There are a lot of SSTables on disk for some tables (events for
 example,
  but not only). I've seen that some compactions are taking up to 32
 SSTables
  (which corresponds to the default max value for SizeTiered compaction).
 
  2) There is a secondary index that I found suspicious : loc.loc_id_idx.
 As
  its name implies I have the impression that it's an index on the id of
 the
  loc which would lead to almost an 1-1 relationship between the indexed
 value
  and the original loc. Such index should be avoided because they do not
  perform well. If it's not an index on the loc_id, please disregard my
 remark
 
  3) There is a clear imbalance of SSTable count on some nodes. In the
 log, I
  saw:
 
  INFO  [STREAM-IN-/...20] 2014-10-25 02:21:43,360
  StreamResultFuture.java:166 - [Stream
 #a6e54ea0-5bed-11e4-8df5-f357715e1a79
  ID#0] Prepare completed. Receiving 163 files(4 111 187 195 bytes),
 sending 0
  files(0 bytes)
 
  INFO  [STREAM-IN-/...81] 2014-10-25 02:21:46,121
  StreamResultFuture.java:166 - [Stream
 #a6e54ea0-5bed-11e4-8df5-f357715e1a79
  ID#0] Prepare completed. Receiving 154 files(3 332 779 920 bytes),
 sending 0
  files(0 bytes)
 
  INFO  [STREAM-IN-/...71] 2014-10-25 02:21:50,494
  StreamResultFuture.java:166 - [Stream
 #a6e54ea0-5bed-11e4-8df5-f357715e1a79
  ID#0] Prepare completed. Receiving 1315 files(4 606 316 933 bytes),
 sending
  0 files(0 bytes)
 
  INFO  [STREAM-IN-/...217] 2014-10-25 02:21:51,036
  StreamResultFuture.java:166 - [Stream
 #a6e54ea0-5bed-11e4-8df5-f357715e1a79
  ID#0] Prepare completed. Receiving 1640 files(3 208 023 573 bytes),
 sending
  0 files(0 bytes)
 
   As you can see, the existing 4 nodes are streaming data to the new node
 and
  on average the data set size is about 3.3 - 4.5 Gb. However the number of
  SSTables is around 150 files for nodes ...20 and
  ...81 but goes through the roof to reach 1315 files for
  ...71 and 1640 files for ...217
 
   The total data set size is roughly the same but the file number is x10,
  which mean that you'll have a bunch of tiny files.
 
   I guess that upon reception of those files, there will be a massive
 flush
  to disk, explaining the behaviour you're facing (flush storm)
 
  I would suggest looking on nodes ...71 and
 ...217 to
  check for the total SSTable count for each table to confirm this
 intuition
 
  Regards
 
 
  On Sun, Oct 26, 2014 at 4:58 PM, Maxime maxim...@gmail.com wrote:
 
  I've emailed you a raw log file of an instance of this happening.
 
  I've been monitoring more closely the timing of events in tpstats and
 the
  logs and I believe this is what is happening:
 
  - For some reason, C* decides to provoke a flush storm (I say some
 reason,
  I'm sure there is one but I have had difficulty determining the
 behaviour
  changes between 1.* and more recent releases).
  - So we see ~ 3000 flush being enqueued.
  - This happens so suddenly that even boosting the number of flush
 writers
  to 20 does not suffice. I don't even see all time blocked numbers for
 it
  before C* stops responding. I suspect this is due to the sudden OOM and
 GC
  occurring.
  - The last tpstat that comes back before the node goes down 

Re: OOM at Bootstrap Time

2014-10-26 Thread Maxime
Hmm, thanks for the reading.

I initially followed some (perhaps too old) maintenance scripts, which
included weekly 'nodetool compact'. Is there a way for me to undo the
damage? Tombstones will be a very important issue for me since the dataset
is very much a rolling dataset using TTLs heavily.

On Sun, Oct 26, 2014 at 6:04 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Should doing a major compaction on those nodes lead to a restructuration
 of the SSTables? -- Beware of the major compaction on SizeTiered, it will
 create 2 giant SSTables and the expired/outdated/tombstone columns in this
 big file will be never cleaned since the SSTable will never get a chance to
 be compacted again

 Essentially to reduce the fragmentation of small SSTables you can stay
 with SizeTiered compaction and play around with compaction properties (the
 thresholds) to make C* group a bunch of files each time it compacts so that
 the file number shrinks to a reasonable count

 Since you're using C* 2.1 and anti-compaction has been introduced, I
 hesitate advising you to use Leveled compaction as a work-around to reduce
 SSTable count.

  Things are a little bit more complicated because of the incremental
 repair process (I don't know whether you're using incremental repair or not
 in production). The Dev blog says that Leveled compaction is performed only
 on repaired SSTables, the un-repaired ones still use SizeTiered, more
 details here:
 http://www.datastax.com/dev/blog/anticompaction-in-cassandra-2-1

 Regards





 On Sun, Oct 26, 2014 at 9:44 PM, Jonathan Haddad j...@jonhaddad.com
 wrote:

 If the issue is related to I/O, you're going to want to determine if
 you're saturated.  Take a look at `iostat -dmx 1`, you'll see avgqu-sz
 (queue size) and svctm, (service time).The higher those numbers
 are, the most overwhelmed your disk is.

 On Sun, Oct 26, 2014 at 12:01 PM, DuyHai Doan doanduy...@gmail.com
 wrote:
  Hello Maxime
 
  Increasing the flush writers won't help if your disk I/O is not keeping
 up.
 
  I've had a look into the log file, below are some remarks:
 
  1) There are a lot of SSTables on disk for some tables (events for
 example,
  but not only). I've seen that some compactions are taking up to 32
 SSTables
  (which corresponds to the default max value for SizeTiered compaction).
 
  2) There is a secondary index that I found suspicious : loc.loc_id_idx.
 As
  its name implies I have the impression that it's an index on the id of
 the
  loc which would lead to almost an 1-1 relationship between the indexed
 value
  and the original loc. Such index should be avoided because they do not
  perform well. If it's not an index on the loc_id, please disregard my
 remark
 
  3) There is a clear imbalance of SSTable count on some nodes. In the
 log, I
  saw:
 
  INFO  [STREAM-IN-/...20] 2014-10-25 02:21:43,360
  StreamResultFuture.java:166 - [Stream
 #a6e54ea0-5bed-11e4-8df5-f357715e1a79
  ID#0] Prepare completed. Receiving 163 files(4 111 187 195 bytes),
 sending 0
  files(0 bytes)
 
  INFO  [STREAM-IN-/...81] 2014-10-25 02:21:46,121
  StreamResultFuture.java:166 - [Stream
 #a6e54ea0-5bed-11e4-8df5-f357715e1a79
  ID#0] Prepare completed. Receiving 154 files(3 332 779 920 bytes),
 sending 0
  files(0 bytes)
 
  INFO  [STREAM-IN-/...71] 2014-10-25 02:21:50,494
  StreamResultFuture.java:166 - [Stream
 #a6e54ea0-5bed-11e4-8df5-f357715e1a79
  ID#0] Prepare completed. Receiving 1315 files(4 606 316 933 bytes),
 sending
  0 files(0 bytes)
 
  INFO  [STREAM-IN-/...217] 2014-10-25 02:21:51,036
  StreamResultFuture.java:166 - [Stream
 #a6e54ea0-5bed-11e4-8df5-f357715e1a79
  ID#0] Prepare completed. Receiving 1640 files(3 208 023 573 bytes),
 sending
  0 files(0 bytes)
 
   As you can see, the existing 4 nodes are streaming data to the new
 node and
  on average the data set size is about 3.3 - 4.5 Gb. However the number
 of
  SSTables is around 150 files for nodes ...20 and
  ...81 but goes through the roof to reach 1315 files for
  ...71 and 1640 files for ...217
 
   The total data set size is roughly the same but the file number is x10,
  which mean that you'll have a bunch of tiny files.
 
   I guess that upon reception of those files, there will be a massive
 flush
  to disk, explaining the behaviour you're facing (flush storm)
 
  I would suggest looking on nodes ...71 and
 ...217 to
  check for the total SSTable count for each table to confirm this
 intuition
 
  Regards
 
 
  On Sun, Oct 26, 2014 at 4:58 PM, Maxime maxim...@gmail.com wrote:
 
  I've emailed you a raw log file of an instance of this happening.
 
  I've been monitoring more closely the timing of events in tpstats and
 the
  logs and I believe this is what is happening:
 
  - For some reason, C* decides to provoke a flush storm (I say some
 reason,
  I'm sure there is one but I have had difficulty determining the
 behaviour
  

Re: OOM at Bootstrap Time

2014-10-25 Thread Laing, Michael
Since no one else has stepped in...

We have run clusters with ridiculously small nodes - I have a production
cluster in AWS with 4GB nodes each with 1 CPU and disk-based instance
storage. It works fine but you can see those little puppies struggle...

And I ran into problems such as you observe...

Upgrading Java to the latest 1.7 and - most importantly - *reverting to the
default configuration, esp. for heap*, seemed to settle things down
completely. Also make sure that you are using the 'recommended production
settings' from the docs on your boxen.

However we are running 2.0.x not 2.1.0 so YMMV.

And we are switching to 15GB nodes w 2 heftier CPUs each and SSD storage -
still a 'small' machine, but much more reasonable for C*.

However I can't say I am an expert, since I deliberately keep things so
simple that we do not encounter problems - it just works so I dig into
other stuff.

ml


On Sat, Oct 25, 2014 at 5:22 PM, Maxime maxim...@gmail.com wrote:

 Hello, I've been trying to add a new node to my cluster ( 4 nodes ) for a
 few days now.

 I started by adding a node similar to my current configuration, 4 GB or
 RAM + 2 Cores on DigitalOcean. However every time, I would end up getting
 OOM errors after many log entries of the type:

 INFO  [SlabPoolCleaner] 2014-10-25 13:44:57,240 ColumnFamilyStore.java:856
 - Enqueuing flush of mycf: 5383 (0%) on-heap, 0 (0%) off-heap

 leading to:

 ka-120-Data.db (39291 bytes) for commitlog position
 ReplayPosition(segmentId=1414243978538, position=23699418)
 WARN  [SharedPool-Worker-13] 2014-10-25 13:48:18,032
 AbstractTracingAwareExecutorService.java:167 - Uncaught exception on thread
 Thread[SharedPool-Worker-13,5,main]: {}
 java.lang.OutOfMemoryError: Java heap space

 Thinking it had to do with either compaction somehow or streaming, 2
 activities I've had tremendous issues with in the past; I tried to slow
 down the setstreamthroughput to extremely low values all the way to 5. I
 also tried setting setcompactionthoughput to 0, and then reading that in
 some cases it might be too fast, down to 8. Nothing worked, it merely
 vaguely changed the mean time to OOM but not in a way indicating either was
 anywhere a solution.

 The nodes were configured with 2 GB of Heap initially, I tried to crank it
 up to 3 GB, stressing the host memory to its limit.

 After doing some exploration (I am considering writing a Cassandra Ops
 documentation with lessons learned since there seems to be little of it in
 organized fashions), I read that some people had strange issues on
 lower-end boxes like that, so I bit the bullet and upgraded my new node to
 a 8GB + 4 Core instance, which was anecdotally better.

 To my complete shock, exact same issues are present, even raising the Heap
 memory to 6 GB. I figure it can't be a normal situation anymore, but must
 be a bug somehow.

 My cluster is 4 nodes, RF of 2, about 160 GB of data across all nodes.
 About 10 CF of varying sizes. Runtime writes are between 300 to 900 /
 second. Cassandra 2.1.0, nothing too wild.

 Has anyone encountered these kinds of issues before? I would really enjoy
 hearing about the experiences of people trying to run small-sized clusters
 like mine. From everything I read, Cassandra operations go very well on
 large (16 GB + 8 Cores) machines, but I'm sad to report I've had nothing
 but trouble trying to run on smaller machines, perhaps I can learn from
 other's experience?

 Full logs can be provided to anyone interested.

 Cheers



Re: OOM at Bootstrap Time

2014-10-25 Thread Maxime
Thanks a lot that is comforting. We are also small at the moment so I
definitely can relate with the idea of keeping small and simple at a level
where it just works.

I see the new Apache version has a lot of fixes so I will try to upgrade
before I look into downgrading.

On Saturday, October 25, 2014, Laing, Michael michael.la...@nytimes.com
wrote:

 Since no one else has stepped in...

 We have run clusters with ridiculously small nodes - I have a production
 cluster in AWS with 4GB nodes each with 1 CPU and disk-based instance
 storage. It works fine but you can see those little puppies struggle...

 And I ran into problems such as you observe...

 Upgrading Java to the latest 1.7 and - most importantly - *reverting to
 the default configuration, esp. for heap*, seemed to settle things down
 completely. Also make sure that you are using the 'recommended production
 settings' from the docs on your boxen.

 However we are running 2.0.x not 2.1.0 so YMMV.

 And we are switching to 15GB nodes w 2 heftier CPUs each and SSD storage -
 still a 'small' machine, but much more reasonable for C*.

 However I can't say I am an expert, since I deliberately keep things so
 simple that we do not encounter problems - it just works so I dig into
 other stuff.

 ml


 On Sat, Oct 25, 2014 at 5:22 PM, Maxime maxim...@gmail.com
 javascript:_e(%7B%7D,'cvml','maxim...@gmail.com'); wrote:

 Hello, I've been trying to add a new node to my cluster ( 4 nodes ) for a
 few days now.

 I started by adding a node similar to my current configuration, 4 GB or
 RAM + 2 Cores on DigitalOcean. However every time, I would end up getting
 OOM errors after many log entries of the type:

 INFO  [SlabPoolCleaner] 2014-10-25 13:44:57,240
 ColumnFamilyStore.java:856 - Enqueuing flush of mycf: 5383 (0%) on-heap, 0
 (0%) off-heap

 leading to:

 ka-120-Data.db (39291 bytes) for commitlog position
 ReplayPosition(segmentId=1414243978538, position=23699418)
 WARN  [SharedPool-Worker-13] 2014-10-25 13:48:18,032
 AbstractTracingAwareExecutorService.java:167 - Uncaught exception on thread
 Thread[SharedPool-Worker-13,5,main]: {}
 java.lang.OutOfMemoryError: Java heap space

 Thinking it had to do with either compaction somehow or streaming, 2
 activities I've had tremendous issues with in the past; I tried to slow
 down the setstreamthroughput to extremely low values all the way to 5. I
 also tried setting setcompactionthoughput to 0, and then reading that in
 some cases it might be too fast, down to 8. Nothing worked, it merely
 vaguely changed the mean time to OOM but not in a way indicating either was
 anywhere a solution.

 The nodes were configured with 2 GB of Heap initially, I tried to crank
 it up to 3 GB, stressing the host memory to its limit.

 After doing some exploration (I am considering writing a Cassandra Ops
 documentation with lessons learned since there seems to be little of it in
 organized fashions), I read that some people had strange issues on
 lower-end boxes like that, so I bit the bullet and upgraded my new node to
 a 8GB + 4 Core instance, which was anecdotally better.

 To my complete shock, exact same issues are present, even raising the
 Heap memory to 6 GB. I figure it can't be a normal situation anymore, but
 must be a bug somehow.

 My cluster is 4 nodes, RF of 2, about 160 GB of data across all nodes.
 About 10 CF of varying sizes. Runtime writes are between 300 to 900 /
 second. Cassandra 2.1.0, nothing too wild.

 Has anyone encountered these kinds of issues before? I would really enjoy
 hearing about the experiences of people trying to run small-sized clusters
 like mine. From everything I read, Cassandra operations go very well on
 large (16 GB + 8 Cores) machines, but I'm sad to report I've had nothing
 but trouble trying to run on smaller machines, perhaps I can learn from
 other's experience?

 Full logs can be provided to anyone interested.

 Cheers





Re: OOM(Java heap space) on start-up during commit log replaying

2014-08-13 Thread jivko donev
Graham,
Thanks for the reply. As I stated in mine first mail increasing the heap size 
fixes the problem but I'm more interesting in figuring out the right properties 
for commitlog and memtable sizes when we need to keep the heap smaller. 
Also I think we are not seeing CASSANDRA-7546 as I apply your patch but the 
problem still persists. 
What more details do you need? I'll be happy to provide them.


On Wednesday, August 13, 2014 1:05 AM, graham sanderson gra...@vast.com wrote:
 


Agreed need more details; and just start by increasing heap because that may 
wells solve the problem.

I have just observed (which makes sense when you think about it) while testing 
fix for https://issues.apache.org/jira/browse/CASSANDRA-7546, that if you are 
replaying a commit log which has a high level of updates for the same partition 
key, you can hit that issue - excess memory allocation under high contention 
for the same partition key - (this might not cause OOM but will certainly 
massively tax GC and it sounds like you don’t have a lot/any headroom).

On Aug 12, 2014, at 12:31 PM, Robert Coli rc...@eventbrite.com wrote:



On Tue, Aug 12, 2014 at 9:34 AM, jivko donev jivko_...@yahoo.com wrote:

We have a node with commit log director ~4G. During start-up of the node on 
commit log replaying the used heap space is constantly growing ending with OOM 
error. 



The heap size and new heap size properties are - 1G and 256M. We are using 
the default settings for commitlog_sync, commitlog_sync_period_in_ms and 
commitlog_segment_size_in_mb.


What version of Cassandra?


1G is tiny for cassandra heap. There is a direct relationship between the data 
in the commitlog and memtables and in the heap. You almost certainly need more 
heap or less commitlog.


=Rob
  

Re: OOM(Java heap space) on start-up during commit log replaying

2014-08-12 Thread Robert Coli
On Tue, Aug 12, 2014 at 9:34 AM, jivko donev jivko_...@yahoo.com wrote:

 We have a node with commit log director ~4G. During start-up of the node
 on commit log replaying the used heap space is constantly growing ending
 with OOM error.

 The heap size and new heap size properties are - 1G and 256M. We are using
 the default settings for commitlog_sync, commitlog_sync_period_in_ms
 and commitlog_segment_size_in_mb.


What version of Cassandra?

1G is tiny for cassandra heap. There is a direct relationship between the
data in the commitlog and memtables and in the heap. You almost certainly
need more heap or less commitlog.

=Rob


Re: OOM(Java heap space) on start-up during commit log replaying

2014-08-12 Thread jivko donev
Hi Robert,

Thanks for your reply. The Cassandra version is 2.07. Is there some commonly 
used rule for determining the commitlog and memtables size depending on the 
heap size? What would be the main disadvantage when having smaller commitlog?


On Tuesday, August 12, 2014 8:32 PM, Robert Coli rc...@eventbrite.com wrote:
 




On Tue, Aug 12, 2014 at 9:34 AM, jivko donev jivko_...@yahoo.com wrote:

We have a node with commit log director ~4G. During start-up of the node on 
commit log replaying the used heap space is constantly growing ending with OOM 
error. 



The heap size and new heap size properties are - 1G and 256M. We are using the 
default settings for commitlog_sync, commitlog_sync_period_in_ms and 
commitlog_segment_size_in_mb.

What version of Cassandra?

1G is tiny for cassandra heap. There is a direct relationship between the data 
in the commitlog and memtables and in the heap. You almost certainly need more 
heap or less commitlog.

=Rob

Re: OOM(Java heap space) on start-up during commit log replaying

2014-08-12 Thread graham sanderson
Agreed need more details; and just start by increasing heap because that may 
wells solve the problem.

I have just observed (which makes sense when you think about it) while testing 
fix for https://issues.apache.org/jira/browse/CASSANDRA-7546, that if you are 
replaying a commit log which has a high level of updates for the same partition 
key, you can hit that issue - excess memory allocation under high contention 
for the same partition key - (this might not cause OOM but will certainly 
massively tax GC and it sounds like you don’t have a lot/any headroom).

On Aug 12, 2014, at 12:31 PM, Robert Coli rc...@eventbrite.com wrote:

 
 On Tue, Aug 12, 2014 at 9:34 AM, jivko donev jivko_...@yahoo.com wrote:
 We have a node with commit log director ~4G. During start-up of the node on 
 commit log replaying the used heap space is constantly growing ending with 
 OOM error. 
 
 The heap size and new heap size properties are - 1G and 256M. We are using 
 the default settings for commitlog_sync, commitlog_sync_period_in_ms and 
 commitlog_segment_size_in_mb.
 
 What version of Cassandra?
 
 1G is tiny for cassandra heap. There is a direct relationship between the 
 data in the commitlog and memtables and in the heap. You almost certainly 
 need more heap or less commitlog.
 
 =Rob
   



smime.p7s
Description: S/MIME cryptographic signature


Re: OOM while performing major compaction

2014-02-27 Thread Edward Capriolo
One big downside about major compaction is that (depending on your
cassandra version) the bloom filters size is pre-calculated. Thus cassandra
needs enough heap for your existing 33 k+ sstables and the new large
compacted one. In the past this happened to us when the compaction thread
got hung up, and the sstables grew. When this happens I delete the data
directory and jump in a fresh node. The time to recover from that sstable
build up can be huge even with multi-threaded compaction.


On Thu, Feb 27, 2014 at 2:09 PM, Nish garg pipeli...@gmail.com wrote:

 I am having OOM during major compaction on one of the column family where
 there are lot of SStables (33000) to be compacted. Is there any other way
 for them to be compacted? Any help will be really appreciated.

 Here are the details

  /opt/cassandra/current/bin/nodetool -h us1emscsm-01  compact tomcat
 sessions
 Error occurred during compaction
 java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java
 heap space
  at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
  at java.util.concurrent.FutureTask.get(FutureTask.java:83)
  at
 org.apache.cassandra.db.compaction.CompactionManager.performMaximal(CompactionManager.java:334)
  at
 org.apache.cassandra.db.ColumnFamilyStore.forceMajorCompaction(ColumnFamilyStore.java:1691)
  at
 org.apache.cassandra.service.StorageService.forceTableCompaction(StorageService.java:2168)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at
 com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93)
  at
 com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27)
  at
 com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
  at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120)
  at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
  at
 com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836)
  at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
  at
 javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427)
  at
 javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72)
  at
 javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265)
  at
 javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1360)
  at
 javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:305)
  at sun.rmi.transport.Transport$1.run(Transport.java:159)
  at java.security.AccessController.doPrivileged(Native Method)
  at sun.rmi.transport.Transport.serviceCall(Transport.java:155)
  at
 sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)
  at
 sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
  at
 sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
  at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
  at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  at java.lang.Thread.run(Thread.java:662)
 Caused by: java.lang.OutOfMemoryError: Java heap space
  at
 org.apache.cassandra.io.util.RandomAccessReader.init(RandomAccessReader.java:77)
  at
 org.apache.cassandra.io.compress.CompressedRandomAccessReader.init(CompressedRandomAccessReader.java:75)
  at
 org.apache.cassandra.io.compress.CompressedThrottledReader.init(CompressedThrottledReader.java:38)
  at
 org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:52)
  at
 org.apache.cassandra.io.sstable.SSTableReader.openDataReader(SSTableReader.java:1212)
  at
 org.apache.cassandra.io.sstable.SSTableScanner.init(SSTableScanner.java:54)
  at
 org.apache.cassandra.io.sstable.SSTableReader.getDirectScanner(SSTableReader.java:1032)
  at
 org.apache.cassandra.io.sstable.SSTableReader.getDirectScanner(SSTableReader.java:1044)
  at
 org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:157)
  at
 org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:163)
  at
 org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:117)
  at
 

Re: OOM while performing major compaction

2014-02-27 Thread Robert Coli
On Thu, Feb 27, 2014 at 11:09 AM, Nish garg pipeli...@gmail.com wrote:

 I am having OOM during major compaction on one of the column family where
 there are lot of SStables (33000) to be compacted. Is there any other way
 for them to be compacted? Any help will be really appreciated.


You can use user defined compaction to reduce the working set, but only a
major compaction is capable of purging 100% of tombstones.

How much garbage is actually in the files? Why do you have 33,000 of them?
You mention a major compaction so you are likely not using LCS with the bad
5mb default... how did you end up with so many SSTables?

Have you removed the throttle from compaction, generally?

What version of Cassandra?

=Rob


Re: OOM while performing major compaction

2014-02-27 Thread Nish garg
Thanks for replying.

We are  on Cassandra 1.2.9.

We have time series like data structure where we need to keep only last 6
hours of data. So we expire data using  expireddatetime column on column
family and then we run expire script via cron to create tombstones. We
don't use ttl yet and planning to use it in our future release. Hope that
will fix some of the issues caused by expire script as it needs to read the
data first before creating tombstones.

So to answer your question, we have almost 80% of tombstones in those
sstables. (There is no easy way to confirm this unless I convert all those
33000 sstables to JSON file and query them for tombstones).
The reason of 33000 of them may be due to machine load too high for minor
compaction and it was falling behind or some thing happened to minor
compaction thread on this node. Other two nodes in this cluster are fine.
Yes, we are using sized compaction strategy.

I am inclined towards 'decommission and bootstrap' this node as it seems
like performing major compaction on this node is impossible.

However still looking for other solutions...




On Thu, Feb 27, 2014 at 4:03 PM, Robert Coli rc...@eventbrite.com wrote:

 On Thu, Feb 27, 2014 at 11:09 AM, Nish garg pipeli...@gmail.com wrote:

 I am having OOM during major compaction on one of the column family where
 there are lot of SStables (33000) to be compacted. Is there any other way
 for them to be compacted? Any help will be really appreciated.


 You can use user defined compaction to reduce the working set, but only a
 major compaction is capable of purging 100% of tombstones.

 How much garbage is actually in the files? Why do you have 33,000 of them?
 You mention a major compaction so you are likely not using LCS with the bad
 5mb default... how did you end up with so many SSTables?

 Have you removed the throttle from compaction, generally?

 What version of Cassandra?

 =Rob




Re: OOM while performing major compaction

2014-02-27 Thread Tupshin Harper
If you can programmatically roll over onto a new column family every 6
hours (or every day or other reasonable increment), and then just drop your
existing column family after all the columns would have been expired, you
could skip your compaction entirely. It was not clear to me from your
description whether *all* of the data only needs to be retained for 6
hours. If that is true, rolling over to a new cf will be your simplest
option.

-Tupshin


On Thu, Feb 27, 2014 at 5:31 PM, Nish garg pipeli...@gmail.com wrote:

 Thanks for replying.

 We are  on Cassandra 1.2.9.

 We have time series like data structure where we need to keep only last 6
 hours of data. So we expire data using  expireddatetime column on column
 family and then we run expire script via cron to create tombstones. We
 don't use ttl yet and planning to use it in our future release. Hope that
 will fix some of the issues caused by expire script as it needs to read the
 data first before creating tombstones.

 So to answer your question, we have almost 80% of tombstones in those
 sstables. (There is no easy way to confirm this unless I convert all those
 33000 sstables to JSON file and query them for tombstones).
 The reason of 33000 of them may be due to machine load too high for minor
 compaction and it was falling behind or some thing happened to minor
 compaction thread on this node. Other two nodes in this cluster are fine.
 Yes, we are using sized compaction strategy.

 I am inclined towards 'decommission and bootstrap' this node as it seems
 like performing major compaction on this node is impossible.

 However still looking for other solutions...




 On Thu, Feb 27, 2014 at 4:03 PM, Robert Coli rc...@eventbrite.com wrote:

 On Thu, Feb 27, 2014 at 11:09 AM, Nish garg pipeli...@gmail.com wrote:

 I am having OOM during major compaction on one of the column family
 where there are lot of SStables (33000) to be compacted. Is there any other
 way for them to be compacted? Any help will be really appreciated.


 You can use user defined compaction to reduce the working set, but only a
 major compaction is capable of purging 100% of tombstones.

 How much garbage is actually in the files? Why do you have 33,000 of
 them? You mention a major compaction so you are likely not using LCS with
 the bad 5mb default... how did you end up with so many SSTables?

 Have you removed the throttle from compaction, generally?

 What version of Cassandra?

 =Rob





Re: OOM while performing major compaction

2014-02-27 Thread Nish garg
Hello Tupshin,

Yes all the data needs to be kept for just last 6 hours. Yes changing to
new CF every 6 hours solves the compaction issue, but between the change we
will have less than 6 hours of data. We can use CF1 and CF2 and truncate
them one at a time every 6 hours in loop but we need some kind of view that
does (CF1 union CF2) to get final data. Unfortunately views are not
supported in Cassandra. May be we can change our code to see 2 CFs all the
time...it is kind of hack and does not seem to be perfect solution.


On Thu, Feb 27, 2014 at 4:49 PM, Tupshin Harper tups...@tupshin.com wrote:

 If you can programmatically roll over onto a new column family every 6
 hours (or every day or other reasonable increment), and then just drop your
 existing column family after all the columns would have been expired, you
 could skip your compaction entirely. It was not clear to me from your
 description whether *all* of the data only needs to be retained for 6
 hours. If that is true, rolling over to a new cf will be your simplest
 option.

 -Tupshin


 On Thu, Feb 27, 2014 at 5:31 PM, Nish garg pipeli...@gmail.com wrote:

 Thanks for replying.

 We are  on Cassandra 1.2.9.

 We have time series like data structure where we need to keep only last 6
 hours of data. So we expire data using  expireddatetime column on column
 family and then we run expire script via cron to create tombstones. We
 don't use ttl yet and planning to use it in our future release. Hope that
 will fix some of the issues caused by expire script as it needs to read the
 data first before creating tombstones.

 So to answer your question, we have almost 80% of tombstones in those
 sstables. (There is no easy way to confirm this unless I convert all those
 33000 sstables to JSON file and query them for tombstones).
 The reason of 33000 of them may be due to machine load too high for minor
 compaction and it was falling behind or some thing happened to minor
 compaction thread on this node. Other two nodes in this cluster are fine.
 Yes, we are using sized compaction strategy.

 I am inclined towards 'decommission and bootstrap' this node as it seems
 like performing major compaction on this node is impossible.

 However still looking for other solutions...




 On Thu, Feb 27, 2014 at 4:03 PM, Robert Coli rc...@eventbrite.comwrote:

 On Thu, Feb 27, 2014 at 11:09 AM, Nish garg pipeli...@gmail.com wrote:

 I am having OOM during major compaction on one of the column family
 where there are lot of SStables (33000) to be compacted. Is there any other
 way for them to be compacted? Any help will be really appreciated.


 You can use user defined compaction to reduce the working set, but only
 a major compaction is capable of purging 100% of tombstones.

 How much garbage is actually in the files? Why do you have 33,000 of
 them? You mention a major compaction so you are likely not using LCS with
 the bad 5mb default... how did you end up with so many SSTables?

 Have you removed the throttle from compaction, generally?

 What version of Cassandra?

 =Rob






Re: OOM while performing major compaction

2014-02-27 Thread Tupshin Harper
You are right that modifying your code to access two CFs is a hack, and not
an ideal solution, but I think it should be pretty easy to implement, and
would help you get out of this jam pretty quickly. Not saying you should go
down that path, but if you lack better options, that would probably be my
choice.

-Tupshin


On Thu, Feb 27, 2014 at 6:03 PM, Nish garg pipeli...@gmail.com wrote:

 Hello Tupshin,

 Yes all the data needs to be kept for just last 6 hours. Yes changing to
 new CF every 6 hours solves the compaction issue, but between the change we
 will have less than 6 hours of data. We can use CF1 and CF2 and truncate
 them one at a time every 6 hours in loop but we need some kind of view that
 does (CF1 union CF2) to get final data. Unfortunately views are not
 supported in Cassandra. May be we can change our code to see 2 CFs all the
 time...it is kind of hack and does not seem to be perfect solution.


 On Thu, Feb 27, 2014 at 4:49 PM, Tupshin Harper tups...@tupshin.comwrote:

 If you can programmatically roll over onto a new column family every 6
 hours (or every day or other reasonable increment), and then just drop your
 existing column family after all the columns would have been expired, you
 could skip your compaction entirely. It was not clear to me from your
 description whether *all* of the data only needs to be retained for 6
 hours. If that is true, rolling over to a new cf will be your simplest
 option.

 -Tupshin


 On Thu, Feb 27, 2014 at 5:31 PM, Nish garg pipeli...@gmail.com wrote:

 Thanks for replying.

 We are  on Cassandra 1.2.9.

 We have time series like data structure where we need to keep only last
 6 hours of data. So we expire data using  expireddatetime column on column
 family and then we run expire script via cron to create tombstones. We
 don't use ttl yet and planning to use it in our future release. Hope that
 will fix some of the issues caused by expire script as it needs to read the
 data first before creating tombstones.

 So to answer your question, we have almost 80% of tombstones in those
 sstables. (There is no easy way to confirm this unless I convert all those
 33000 sstables to JSON file and query them for tombstones).
 The reason of 33000 of them may be due to machine load too high for
 minor compaction and it was falling behind or some thing happened to minor
 compaction thread on this node. Other two nodes in this cluster are fine.
 Yes, we are using sized compaction strategy.

 I am inclined towards 'decommission and bootstrap' this node as it seems
 like performing major compaction on this node is impossible.

 However still looking for other solutions...




 On Thu, Feb 27, 2014 at 4:03 PM, Robert Coli rc...@eventbrite.comwrote:

 On Thu, Feb 27, 2014 at 11:09 AM, Nish garg pipeli...@gmail.comwrote:

 I am having OOM during major compaction on one of the column family
 where there are lot of SStables (33000) to be compacted. Is there any 
 other
 way for them to be compacted? Any help will be really appreciated.


 You can use user defined compaction to reduce the working set, but only
 a major compaction is capable of purging 100% of tombstones.

 How much garbage is actually in the files? Why do you have 33,000 of
 them? You mention a major compaction so you are likely not using LCS with
 the bad 5mb default... how did you end up with so many SSTables?

 Have you removed the throttle from compaction, generally?

 What version of Cassandra?

 =Rob







Re: OOM after some days related to RunnableScheduledFuture and meter persistance

2014-01-08 Thread Tyler Hobbs
I believe this is https://issues.apache.org/jira/browse/CASSANDRA-6358,
which was fixed in 2.0.3.


On Wed, Jan 8, 2014 at 7:15 AM, Desimpel, Ignace ignace.desim...@nuance.com
 wrote:

  Hi,



 On linux and cassandra version 2.0.2 I had an OOM after a heavy load and
 then some (15 ) days of idle running (not exactly idle but very very low
 activity).

 Two out of a 4 machine cluster had this OOM.



 I checked the heap dump (9GB) and that tells me :



 One instance of *java.util.concurrent.ScheduledThreadPoolExecutor*loaded by 
 *system
 class loader* occupies *8.927.175.368 (94,53%)* bytes. The instance is
 referenced by *org.apache.cassandra.io.sstable.SSTableReader @
 0x7fadf89e0* , loaded by *sun.misc.Launcher$AppClassLoader @
 0x683e6ad30*. The memory is accumulated in one instance of
 *java.util.concurrent.RunnableScheduledFuture[]* loaded by *system
 class loader*.



 So I checked the SSTableReader instance and found out the
 ‘ScheduledThreadPoolExecutor syncExecutor ‘ object is holding about 600k of
 ScheduledFutureTasks.

 According to the code on SSTableReader these tasks must have been created
 by the code line syncExecutor.scheduleAtFixedRate. That means that none of
 these tasks ever get scheduled because some (and only one) initial task is
 probably blocking.

 But then again, the one thread to execute these tasks, seems to be in a
 ‘normal’ state (at time of OOM) and is executing with a stack trace pasted
 below :



 Thread 0x696777eb8

   at
 org.apache.cassandra.db.AtomicSortedColumns$1.create(Lorg/apache/cassandra/config/CFMetaData;Z)Lorg/apache/cassandra/db/AtomicSortedColumns;
 (AtomicSortedColumns.java:58)

   at
 org.apache.cassandra.db.AtomicSortedColumns$1.create(Lorg/apache/cassandra/config/CFMetaData;Z)Lorg/apache/cassandra/db/ColumnFamily;
 (AtomicSortedColumns.java:55)

   at
 org.apache.cassandra.db.ColumnFamily.cloneMeShallow(Lorg/apache/cassandra/db/ColumnFamily$Factory;Z)Lorg/apache/cassandra/db/ColumnFamily;
 (ColumnFamily.java:70)

   at
 org.apache.cassandra.db.Memtable.resolve(Lorg/apache/cassandra/db/DecoratedKey;Lorg/apache/cassandra/db/ColumnFamily;Lorg/apache/cassandra/db/index/SecondaryIndexManager$Updater;)V
 (Memtable.java:187)

   at
 org.apache.cassandra.db.Memtable.put(Lorg/apache/cassandra/db/DecoratedKey;Lorg/apache/cassandra/db/ColumnFamily;Lorg/apache/cassandra/db/index/SecondaryIndexManager$Updater;)V
 (Memtable.java:158)

   at
 org.apache.cassandra.db.ColumnFamilyStore.apply(Lorg/apache/cassandra/db/DecoratedKey;Lorg/apache/cassandra/db/ColumnFamily;Lorg/apache/cassandra/db/index/SecondaryIndexManager$Updater;)V
 (ColumnFamilyStore.java:840)

   at
 org.apache.cassandra.db.Keyspace.apply(Lorg/apache/cassandra/db/RowMutation;ZZ)V
 (Keyspace.java:373)

   at
 org.apache.cassandra.db.Keyspace.apply(Lorg/apache/cassandra/db/RowMutation;Z)V
 (Keyspace.java:338)

   at org.apache.cassandra.db.RowMutation.apply()V (RowMutation.java:201)

   at
 org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(Lorg/apache/cassandra/service/QueryState;)Lorg/apache/cassandra/transport/messages/ResultMessage;
 (ModificationStatement.java:477)

   at
 org.apache.cassandra.cql3.QueryProcessor.processInternal(Ljava/lang/String;)Lorg/apache/cassandra/cql3/UntypedResultSet;
 (QueryProcessor.java:178)

   at
 org.apache.cassandra.db.SystemKeyspace.persistSSTableReadMeter(Ljava/lang/String;Ljava/lang/String;ILorg/apache/cassandra/metrics/RestorableMeter;)V
 (SystemKeyspace.java:938)

   at org.apache.cassandra.io.sstable.SSTableReader$2.run()V
 (SSTableReader.java:342)

   at
 java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object;
 (Executors.java:471)

   at java.util.concurrent.FutureTask.runAndReset()Z (FutureTask.java:304)

   at
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Ljava/util/concurrent/ScheduledThreadPoolExecutor$ScheduledFutureTask;)Z
 (ScheduledThreadPoolExecutor.java:178)

   at
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run()V
 (ScheduledThreadPoolExecutor.java:293)

   at
 java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
 (ThreadPoolExecutor.java:1145)

   at java.util.concurrent.ThreadPoolExecutor$Worker.run()V
 (ThreadPoolExecutor.java:615)

   at java.lang.Thread.run()V (Thread.java:724)





 Since each of these tasks are throttled by meterSyncThrottle.acquire() I
 suspect that the RateLimiter is causing a delay. The RateLimiter instance
 attributes are :

 Type|Name|Value

 long|nextFreeTicketMicros|3016022567383

 double|maxPermits|100.0

 double|storedPermits|99.0

 long|offsetNanos|334676357831746



 I guess that these attributes will practically result in a blocking
 behavior, resulting in the OOM …



 Is there someone that can make sense out of it?

 I hope this helps in finding out what the reason is for this and maybe
 could be avoided in the future. I still have the heap dump, so 

Re: OOM while reading key cache

2013-11-14 Thread olek.stas...@gmail.com
Yes, as I wrote in first e-mail.  When I removed key cache file
cassandra started without further problems.
regards
Olek

2013/11/13 Robert Coli rc...@eventbrite.com:

 On Wed, Nov 13, 2013 at 12:35 AM, Tom van den Berge t...@drillster.com
 wrote:

 I'm having the same problem, after upgrading from 1.2.3 to 1.2.10.

 I can remember this was a bug that was solved in the 1.0 or 1.1 version
 some time ago, but apparently it got back.
 A workaround is to delete the contents of the saved_caches directory
 before starting up.


 Yours is not the first report of this I've heard resulting from a 1.2.x to
 1.2.x upgrade. Reports are of the form I had to nuke my saved_caches or I
 couldn't start my node, it OOMED, etc..

 https://issues.apache.org/jira/browse/CASSANDRA-6325

 Exists, but doesn't seem  to be the same issue.

 https://issues.apache.org/jira/browse/CASSANDRA-5986

 Similar, doesn't seem to be an issue triggered by upgrade..

 If I were one of the posters on this thread, I would strongly consider
 filing a JIRA on point.

 @OP (olek) : did removing the saved_caches also fix your problem?

 =Rob




Re: OOM while reading key cache

2013-11-14 Thread Fabien Rousseau
A few month ago, we've got a similar issue on 1.2.6 :
https://issues.apache.org/jira/browse/CASSANDRA-5706

But it has been fixed and did not encountered this issue anymore (we're
also on 1.2.10)


2013/11/14 olek.stas...@gmail.com olek.stas...@gmail.com

 Yes, as I wrote in first e-mail.  When I removed key cache file
 cassandra started without further problems.
 regards
 Olek

 2013/11/13 Robert Coli rc...@eventbrite.com:
 
  On Wed, Nov 13, 2013 at 12:35 AM, Tom van den Berge t...@drillster.com
  wrote:
 
  I'm having the same problem, after upgrading from 1.2.3 to 1.2.10.
 
  I can remember this was a bug that was solved in the 1.0 or 1.1 version
  some time ago, but apparently it got back.
  A workaround is to delete the contents of the saved_caches directory
  before starting up.
 
 
  Yours is not the first report of this I've heard resulting from a 1.2.x
 to
  1.2.x upgrade. Reports are of the form I had to nuke my saved_caches or
 I
  couldn't start my node, it OOMED, etc..
 
  https://issues.apache.org/jira/browse/CASSANDRA-6325
 
  Exists, but doesn't seem  to be the same issue.
 
  https://issues.apache.org/jira/browse/CASSANDRA-5986
 
  Similar, doesn't seem to be an issue triggered by upgrade..
 
  If I were one of the posters on this thread, I would strongly consider
  filing a JIRA on point.
 
  @OP (olek) : did removing the saved_caches also fix your problem?
 
  =Rob
 
 




-- 
Fabien Rousseau


 aur...@yakaz.comwww.yakaz.com


Re: OOM while reading key cache

2013-11-13 Thread Robert Coli
On Wed, Nov 13, 2013 at 12:35 AM, Tom van den Berge t...@drillster.comwrote:

 I'm having the same problem, after upgrading from 1.2.3 to 1.2.10.

 I can remember this was a bug that was solved in the 1.0 or 1.1 version
 some time ago, but apparently it got back.
 A workaround is to delete the contents of the saved_caches directory
 before starting up.


Yours is not the first report of this I've heard resulting from a 1.2.x to
1.2.x upgrade. Reports are of the form I had to nuke my saved_caches or I
couldn't start my node, it OOMED, etc..

https://issues.apache.org/jira/browse/CASSANDRA-6325

Exists, but doesn't seem  to be the same issue.

https://issues.apache.org/jira/browse/CASSANDRA-5986

Similar, doesn't seem to be an issue triggered by upgrade..

If I were one of the posters on this thread, I would strongly consider
filing a JIRA on point.

@OP (olek) : did removing the saved_caches also fix your problem?

=Rob


Re: OOM while reading key cache

2013-11-11 Thread Aaron Morton
 -6 machines which 8gb RAM each and three 150GB disks each
 -default heap configuration
With 8GB the default heap is 2GB, try kicking that up to 4GB and a 600 to 800 
MB new heap. 

I would guess for the data load  you have 2GB is not enough. 

hope that helps. 

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 8/11/2013, at 11:31 pm, olek.stas...@gmail.com wrote:

 Hello,
 I'm facing OOM on reading key_cache
 Cluster conf is as follows:
 -6 machines which 8gb RAM each and three 150GB disks each
 -default heap configuration
 -deafult key cache configuration
 -the biggest keyspace has abt 500GB size (RF: 2, so in fact there is
 250GB of raw data).
 
 After upgrading first of the machines from 1.2.11 to 2.0.2 i've recieved 
 error:
 INFO [main] 2013-11-08 10:53:16,716 AutoSavingCache.java (line 114)
 reading saved cache
 /home/synat/nosql_filesystem/cassandra/data/saved_caches/production_storage-METADATA-KeyCache-b.db
 ERROR [main] 2013-11-08 10:53:16,895 CassandraDaemon.java (line 478)
 Exception encountered during startup
 java.lang.OutOfMemoryError: Java heap space
at 
 org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:394)
at 
 org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:355)
at 
 org.apache.cassandra.service.CacheService$KeyCacheSerializer.deserialize(CacheService.java:352)
at 
 org.apache.cassandra.cache.AutoSavingCache.loadSaved(AutoSavingCache.java:119)
at 
 org.apache.cassandra.db.ColumnFamilyStore.init(ColumnFamilyStore.java:264)
at 
 org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:409)
at 
 org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:381)
at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:314)
at org.apache.cassandra.db.Keyspace.init(Keyspace.java:268)
at org.apache.cassandra.db.Keyspace.open(Keyspace.java:110)
at org.apache.cassandra.db.Keyspace.open(Keyspace.java:88)
at 
 org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:274)
at 
 org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:461)
at 
 org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:504)
 
 
 Error appears every start, so I've decided to disable key cache (this
 was not helpful) and temporarily moved key cache out of cache folder
 (file was of size 13M). That helps in starting node, but this is only
 workaround and it's not demanded configuration. Anyone has any idea
 what is the real cause of problem with oom?
 best regards
 Aleksander
 ps. I've still 5 nodes to upgrade, I'll inform if problem apperas on the rest.



Re: OOM on replaying CommitLog with Cassandra 2.0.0

2013-11-07 Thread Fabian Seifert
Thanks i missed that issue but it solved our Problems.


Regards

Fabian






From: Robert Coli
Sent: ‎Tuesday‎, ‎5‎ ‎November‎ ‎2013 ‎19‎:‎12
To: user@cassandra.apache.org




On Tue, Nov 5, 2013 at 12:06 AM, Fabian Seifert fabian.seif...@frischmann.biz 
wrote:






It keeps crashing with OOM on CommitLog replay:




https://issues.apache.org/jira/browse/CASSANDRA-6087




Probably this issue, fixed in 2.0.2.




=Rob

Re: OOM on replaying CommitLog with Cassandra 2.0.0

2013-11-05 Thread Robert Coli
On Tue, Nov 5, 2013 at 12:06 AM, Fabian Seifert 
fabian.seif...@frischmann.biz wrote:

  It keeps crashing with OOM on CommitLog replay:


https://issues.apache.org/jira/browse/CASSANDRA-6087

Probably this issue, fixed in 2.0.2.

=Rob


Re: OOM when applying migrations

2012-09-20 Thread Jason Wee
Hi, when the heap is going more than 70% usage, you should be able to see
in the log, many flushing, or reducing the row cache size down. Did you
restart the cassandra daemon in the node that thrown OOM?

On Thu, Sep 20, 2012 at 9:11 PM, Vanger disc...@gmail.com wrote:

  Hello,
 We are trying to add new nodes to our *6-node* cassandra cluster with
 RF=3 cassandra version 1.0.11. We are *adding 18 new nodes* one-by-one.

 First strange thing, I've noticed, is the number of completed
 MigrationStage in nodetool tpstats grows for every new node, while schema
 is not changed. For now with 21-nodes ring, for final join it shows 184683
 migrations, while with 7-nodes it was about 50k migrations.
 In fact it seems that this number is not a number of applied migrations.
 When i grep log file with
 grep Applying migration /var/log/cassandra/system.log -c
 For each new node result is pretty much the same - around 7500 Applying
 migration found in log.

 And the real problem is that now new nodes fail with Out Of Memory while
 building schema from migrations. In logs we can find the following:

 WARN [ScheduledTasks:1] 2012-09-19 18:51:22,497 GCInspector.java (line
 145) Heap is 0.7712290960125684 full.  You may need to reduce memtable
 and/or cache sizes.  Cassandra will now flush up to the two largest
 memtables to free up memory.  Adjust flush_largest_memtables_at threshold
 in cassandra.yaml if you don't want Cassandra to do this automatically
  INFO [ScheduledTasks:1] 2012-09-19 18:51:22,498 StorageService.java (line
 2658) Unable to reduce heap usage since there are no dirty column families
 
  WARN [ScheduledTasks:1] 2012-09-19 18:51:29,500 GCInspector.java (line
 139) Heap is 0.853078131310858 full.  You may need to reduce memtable
 and/or cache sizes.  Cassandra is now reducing cache sizes to free up
 memory.  Adjust reduce_cache_sizes_at threshold in cassandra.yaml if you
 don't want Cassandra to do this automatically
  WARN [ScheduledTasks:1] 2012-09-19 18:51:29,500 AutoSavingCache.java
 (line 187) Reducing AppUser RowCache capacity from 10 to 0 to reduce
 memory pressure
  WARN [ScheduledTasks:1] 2012-09-19 18:51:29,500 AutoSavingCache.java
 (line 187) Reducing AppUser KeyCache capacity from 10 to 0 to reduce
 memory pressure
  WARN [ScheduledTasks:1] 2012-09-19 18:51:29,500 AutoSavingCache.java
 (line 187) Reducing PaymentClaim KeyCache capacity from 5 to 0 to
 reduce memory pressure
  WARN [ScheduledTasks:1] 2012-09-19 18:51:29,500 AutoSavingCache.java
 (line 187) Reducing Organization RowCache capacity from 1000 to 0 to reduce
 memory pressure
  .
  INFO [main] 2012-09-19 18:57:14,181 StorageService.java (line 668)
 JOINING: waiting for schema information to complete
 ERROR [Thread-28] 2012-09-19 18:57:14,198 AbstractCassandraDaemon.java
 (line 139) Fatal exception in thread Thread[Thread-28,5,main]
 java.lang.OutOfMemoryError: Java heap space
 at
 org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:140)
 at
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:115)
 ...
 ERROR [ReadStage:353] 2012-09-19 18:57:20,453 AbstractCassandraDaemon.java
 (line 139) Fatal exception in thread Thread[ReadStage:353,5,main]
 java.lang.OutOfMemoryError: Java heap space
 at
 org.apache.cassandra.service.MigrationManager.makeColumns(MigrationManager.java:256)
 at
 org.apache.cassandra.db.DefinitionsUpdateVerbHandler.doVerb(DefinitionsUpdateVerbHandler.java:51)
 at
 org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)


 Originally max heap size was set to 6G. Then we increased heap size
 limit to 8G and it works. But warnings still present

  WARN [ScheduledTasks:1] 2012-09-20 11:39:11,373 GCInspector.java (line
 145) Heap is 0.7760745735786222 full.  You may need to reduce memtable
 and/or cache sizes.  Cassandra will now flush up to the two largest
 memtables to free up memory.  Adjust flush_largest_memtables_at threshold
 in cassandra.yaml if you don't want Cassandra to do this automatically
  INFO [ScheduledTasks:1] 2012-09-20 11:39:11,374 StorageService.java (line
 2658) Unable to reduce heap usage since there are no dirty column families

 It is probably a bug in applying migrations.
 Could anyone explain why cassandra behaves this way? Could you please
 recommend us smth to cope with this situation?
 Thank you in advance.

 --
 W/ best regards,
 Sergey B.




Re: OOM when applying migrations

2012-09-20 Thread Tyler Hobbs
This should explain the schema issue in 1.0 that has been fixed in 1.1:
http://www.datastax.com/dev/blog/the-schema-management-renaissance

On Thu, Sep 20, 2012 at 10:17 AM, Jason Wee peich...@gmail.com wrote:

 Hi, when the heap is going more than 70% usage, you should be able to see
 in the log, many flushing, or reducing the row cache size down. Did you
 restart the cassandra daemon in the node that thrown OOM?


 On Thu, Sep 20, 2012 at 9:11 PM, Vanger disc...@gmail.com wrote:

  Hello,
 We are trying to add new nodes to our *6-node* cassandra cluster with
 RF=3 cassandra version 1.0.11. We are *adding 18 new nodes* one-by-one.

 First strange thing, I've noticed, is the number of completed
 MigrationStage in nodetool tpstats grows for every new node, while schema
 is not changed. For now with 21-nodes ring, for final join it shows 184683
 migrations, while with 7-nodes it was about 50k migrations.
 In fact it seems that this number is not a number of applied migrations.
 When i grep log file with
 grep Applying migration /var/log/cassandra/system.log -c
 For each new node result is pretty much the same - around 7500 Applying
 migration found in log.

 And the real problem is that now new nodes fail with Out Of Memory while
 building schema from migrations. In logs we can find the following:

 WARN [ScheduledTasks:1] 2012-09-19 18:51:22,497 GCInspector.java (line
 145) Heap is 0.7712290960125684 full.  You may need to reduce memtable
 and/or cache sizes.  Cassandra will now flush up to the two largest
 memtables to free up memory.  Adjust flush_largest_memtables_at threshold
 in cassandra.yaml if you don't want Cassandra to do this automatically
  INFO [ScheduledTasks:1] 2012-09-19 18:51:22,498 StorageService.java
 (line 2658) Unable to reduce heap usage since there are no dirty column
 families
 
  WARN [ScheduledTasks:1] 2012-09-19 18:51:29,500 GCInspector.java (line
 139) Heap is 0.853078131310858 full.  You may need to reduce memtable
 and/or cache sizes.  Cassandra is now reducing cache sizes to free up
 memory.  Adjust reduce_cache_sizes_at threshold in cassandra.yaml if you
 don't want Cassandra to do this automatically
  WARN [ScheduledTasks:1] 2012-09-19 18:51:29,500 AutoSavingCache.java
 (line 187) Reducing AppUser RowCache capacity from 10 to 0 to reduce
 memory pressure
  WARN [ScheduledTasks:1] 2012-09-19 18:51:29,500 AutoSavingCache.java
 (line 187) Reducing AppUser KeyCache capacity from 10 to 0 to reduce
 memory pressure
  WARN [ScheduledTasks:1] 2012-09-19 18:51:29,500 AutoSavingCache.java
 (line 187) Reducing PaymentClaim KeyCache capacity from 5 to 0 to
 reduce memory pressure
  WARN [ScheduledTasks:1] 2012-09-19 18:51:29,500 AutoSavingCache.java
 (line 187) Reducing Organization RowCache capacity from 1000 to 0 to reduce
 memory pressure
  .
  INFO [main] 2012-09-19 18:57:14,181 StorageService.java (line 668)
 JOINING: waiting for schema information to complete
 ERROR [Thread-28] 2012-09-19 18:57:14,198 AbstractCassandraDaemon.java
 (line 139) Fatal exception in thread Thread[Thread-28,5,main]
 java.lang.OutOfMemoryError: Java heap space
 at
 org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:140)
 at
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:115)
 ...
 ERROR [ReadStage:353] 2012-09-19 18:57:20,453
 AbstractCassandraDaemon.java (line 139) Fatal exception in thread
 Thread[ReadStage:353,5,main]
 java.lang.OutOfMemoryError: Java heap space
 at
 org.apache.cassandra.service.MigrationManager.makeColumns(MigrationManager.java:256)
 at
 org.apache.cassandra.db.DefinitionsUpdateVerbHandler.doVerb(DefinitionsUpdateVerbHandler.java:51)
 at
 org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)


 Originally max heap size was set to 6G. Then we increased heap size
 limit to 8G and it works. But warnings still present

  WARN [ScheduledTasks:1] 2012-09-20 11:39:11,373 GCInspector.java (line
 145) Heap is 0.7760745735786222 full.  You may need to reduce memtable
 and/or cache sizes.  Cassandra will now flush up to the two largest
 memtables to free up memory.  Adjust flush_largest_memtables_at threshold
 in cassandra.yaml if you don't want Cassandra to do this automatically
  INFO [ScheduledTasks:1] 2012-09-20 11:39:11,374 StorageService.java
 (line 2658) Unable to reduce heap usage since there are no dirty column
 families

 It is probably a bug in applying migrations.
 Could anyone explain why cassandra behaves this way? Could you please
 recommend us smth to cope with this situation?
 Thank you in advance.

 --
 W/ best regards,
 Sergey B.





-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: OOM opening bloom filter

2012-03-13 Thread aaron morton
Thanks for the update. 

How much smaller did the BF get to ? 

A

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 13/03/2012, at 8:24 AM, Mick Semb Wever wrote:

 
 It's my understanding then for this use case that bloom filters are of
 little importance and that i can
 
 
 Ok. To summarise our actions to get us out of this situation, in hope
 that it may help others one day, we did the following actions:
 
 1) upgrade to 1.0.7
 2) set fp_ratio=0.99
 3) set index_interval=4096
 4) restarted the node with Xmx30G
 5) run `nodetool scrub` 
  and monitor total size of bf files
  using `du -hc *-Filter.db | grep total`
 6) restart node with original Xmx setting once total bf size is under
  (scrub was running for 12hrs)
  (remaining bloom filters can be rebuilt later from normal compact)
 
 Hopefully it will also eventuate that this cluster can run with a more
 normal Xmx4G rather than the previous Xmx12G.
 
 (2) and (3) are very much dependent on our set up using hadoop where all
 reads are get_range_slice with 16k rows per request. Both could be tuned
 correctly but they're the numbers that worked first up.
 
 ~mck
 
 -- 
 When there is no enemy within, the enemies outside can't hurt you.
 African proverb 
 
 | http://github.com/finn-no | http://tech.finn.no |



Re: OOM opening bloom filter

2012-03-13 Thread Mick Semb Wever


 How much smaller did the BF get to ? 

After pending compactions completed today, i'm presuming fp_ratio is
applied now to all sstables in the keyspace, it has gone from 20G+ down
to 1G. This node is now running comfortably on Xmx4G (used heap ~1.5G).


~mck


-- 
A Microsoft Certified System Engineer is to information technology as a
McDonalds Certified Food Specialist is to the culinary arts. Michael
Bacarella 

| http://github.com/finn-no | http://tech.finn.no |


signature.asc
Description: This is a digitally signed message part


Re: OOM opening bloom filter

2012-03-12 Thread aaron morton
 It's my understanding then for this use case that bloom filters are of
 little importance and that i can
 
Yes.
AFAIK there is only one position seek (that will use the bloom filter)  at the 
start of a get_range_slice request. After that the iterators step over the rows 
in the -Data files. 

For the same reason caches may be considered a little less useful.  

Hope that helps. 

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 12/03/2012, at 12:44 PM, Mick Semb Wever wrote:

 On Sun, 2012-03-11 at 15:36 -0700, Peter Schuller wrote:
 Are you doing RF=1? 
 
 That is correct. So are you calculations then :-)
 
 
 very small, 1k. Data from this cf is only read via hadoop jobs in batch
 reads of 16k rows at a time.
 [snip]
 It's my understanding then for this use case that bloom filters are of
 little importance and that i can
 
 Depends. I'm not familiar enough with how the hadoop integration works
 so someone else will have to comment, but if your hadoop jobs are just
 performan normal reads of keys via thrift and the keys they are
 grabbing are not in token order, those reads would be effectively
 random and bloom filters should still be highly relevant to the amount
 of I/O operations you need to perform. 
 
 They are thrift get_range_slice reads of 16k rows per request.
 Hadoop reads are based on tokens, but in my use case the keys are also
 ordered and this cluster is using BOP.
 
 ~mck
 
 -- 
 Living on Earth is expensive, but it does include a free trip around
 the sun every year. Unknown 
 
 | http://github.com/finn-no | http://tech.finn.no |



Re: OOM opening bloom filter

2012-03-12 Thread Mick Semb Wever

It's my understanding then for this use case that bloom filters are of
little importance and that i can


Ok. To summarise our actions to get us out of this situation, in hope
that it may help others one day, we did the following actions:

 1) upgrade to 1.0.7
 2) set fp_ratio=0.99
 3) set index_interval=4096
 4) restarted the node with Xmx30G
 5) run `nodetool scrub` 
  and monitor total size of bf files
  using `du -hc *-Filter.db | grep total`
 6) restart node with original Xmx setting once total bf size is under
  (scrub was running for 12hrs)
  (remaining bloom filters can be rebuilt later from normal compact)

Hopefully it will also eventuate that this cluster can run with a more
normal Xmx4G rather than the previous Xmx12G.

(2) and (3) are very much dependent on our set up using hadoop where all
reads are get_range_slice with 16k rows per request. Both could be tuned
correctly but they're the numbers that worked first up.

~mck

-- 
When there is no enemy within, the enemies outside can't hurt you.
African proverb 

| http://github.com/finn-no | http://tech.finn.no |


signature.asc
Description: This is a digitally signed message part


Re: OOM opening bloom filter

2012-03-11 Thread Peter Schuller
 How did this this bloom filter get too big?

Bloom filters grow with the amount of row keys you have. It is natural
that they grow bigger over time. The question is whether there is
something wrong with this node (for example, lots of sstables and
disk space used due to compaction not running, etc) or whether your
cluster is simply increasing it's use of row keys over time. You'd
want graphs to be able to see the trends. If you don't, I'd start by
comparing this node with other nodes in the cluster and figure out
whether there is a very significant difference or not.

In any case, a bigger heap will allow you to start up again. But you
should definitely make sure you know what's going on (natural growth
of data vs. some problem) if you want to avoid problems in the future.

If it is legitimate use of memory, you *may*, depending on your
workload, want to adjust target bloom filter false positive rates:

   https://issues.apache.org/jira/browse/CASSANDRA-3497

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: OOM opening bloom filter

2012-03-11 Thread Mick Semb Wever
On Sun, 2012-03-11 at 15:06 -0700, Peter Schuller wrote:
 If it is legitimate use of memory, you *may*, depending on your
 workload, want to adjust target bloom filter false positive rates:
 
https://issues.apache.org/jira/browse/CASSANDRA-3497 

This particular cf has up to ~10 billion rows over 3 nodes. Each row is
very small, 1k. Data from this cf is only read via hadoop jobs in batch
reads of 16k rows at a time. 

*-Data.db files are typically ~50G, and *-Filter.db files typically 2G
although some are 7Gb.
At the moment there are many pending compactions, but i can't do any
because the node crashes at startup.

It's my understanding then for this use case that bloom filters are of
little importance and that i can 
 - upgrade to 1.0.7
 - set fp_ratio=0.99
 - set index_interval=1024

This should alleviate much of the memory problems.
Is this correct?

~mck

-- 
It seems that perfection is reached not when there is nothing left to
add, but when there is nothing left to take away Antoine de Saint
Exupéry (William of Ockham) 

| http://github.com/finn-no | http://tech.finn.no |



signature.asc
Description: This is a digitally signed message part


Re: OOM opening bloom filter

2012-03-11 Thread Mick Semb Wever
On Sun, 2012-03-11 at 15:36 -0700, Peter Schuller wrote:
 Are you doing RF=1? 

That is correct. So are you calculations then :-)


  very small, 1k. Data from this cf is only read via hadoop jobs in batch
  reads of 16k rows at a time.
 [snip]
  It's my understanding then for this use case that bloom filters are of
  little importance and that i can
 
 Depends. I'm not familiar enough with how the hadoop integration works
 so someone else will have to comment, but if your hadoop jobs are just
 performan normal reads of keys via thrift and the keys they are
 grabbing are not in token order, those reads would be effectively
 random and bloom filters should still be highly relevant to the amount
 of I/O operations you need to perform. 

They are thrift get_range_slice reads of 16k rows per request.
Hadoop reads are based on tokens, but in my use case the keys are also
ordered and this cluster is using BOP.

~mck

-- 
Living on Earth is expensive, but it does include a free trip around
the sun every year. Unknown 

| http://github.com/finn-no | http://tech.finn.no |


signature.asc
Description: This is a digitally signed message part


Re: OOM

2011-11-02 Thread Ben Coverston
Smells like ulimit. Have you been able to reproduce this with the C*
process running as root?

On Wed, Nov 2, 2011 at 8:12 AM, A J s5a...@gmail.com wrote:

 java.lang.OutOfMemoryError: unable to create new native thread


Re: OOM on CompressionMetadata.readChunkOffsets(..)

2011-10-31 Thread Mick Semb Wever
On Mon, 2011-10-31 at 09:07 +0100, Mick Semb Wever wrote:
 The read pattern of these rows is always in bulk so the chunk_length
 could have been much higher so to reduce memory usage (my largest
 sstable is 61G). 

Isn't CompressionMetadata.readChunkOffsets(..) rather dangerous here?

Given a 60G sstable, even with 64kb chunk_length, to read just that one
sstable requires close to 8G free heap memory...

Especially when the default for cassandra is 4G heap in total.

~mck

-- 
Anyone who has attended a computer conference in a fancy hotel can tell
you that a sentence like You're one of those computer people, aren't
you? is roughly equivalent to Look, another amazingly mobile form of
slime mold! in the mouth of a hotel cocktail waitress. Elizabeth
Zwicky 

| http://semb.wever.org | http://sesat.no |
| http://tech.finn.no   | Java XSS Filter |


signature.asc
Description: This is a digitally signed message part


Re: OOM on CompressionMetadata.readChunkOffsets(..)

2011-10-31 Thread Mick Semb Wever
On Mon, 2011-10-31 at 13:05 +0100, Mick Semb Wever wrote:
 Given a 60G sstable, even with 64kb chunk_length, to read just that one
 sstable requires close to 8G free heap memory... 

Arg, that calculation was a little off...
 (a long isn't exactly 8K...)

But you get my concern...

~mck

-- 
When you say: I wrote a program that crashed Windows, people just
stare at you blankly and say: Hey, I got those with the system -- for
free. Linus Torvalds 

| http://semb.wever.org | http://sesat.no |
| http://tech.finn.no   | Java XSS Filter |


signature.asc
Description: This is a digitally signed message part


Re: OOM on CompressionMetadata.readChunkOffsets(..)

2011-10-31 Thread Sylvain Lebresne
On Mon, Oct 31, 2011 at 1:10 PM, Mick Semb Wever m...@apache.org wrote:
 On Mon, 2011-10-31 at 13:05 +0100, Mick Semb Wever wrote:
 Given a 60G sstable, even with 64kb chunk_length, to read just that one
 sstable requires close to 8G free heap memory...

 Arg, that calculation was a little off...
  (a long isn't exactly 8K...)

 But you get my concern...

Well, with a long being only 8 bytes, that's 8MB of free heap memory. Without
being negligible, that's not completely crazy to me.

No, the problem is that we create those 8MB for each reads, which *is* crazy
(the fact that we allocate those 8MB in one block is not very nice for
the GC either
but that's another problem).
Anyway, that's really a bug and I've created CASSANDRA-3427 to fix.

--
Sylvain


 ~mck

 --
 When you say: I wrote a program that crashed Windows, people just
 stare at you blankly and say: Hey, I got those with the system -- for
 free. Linus Torvalds

 | http://semb.wever.org | http://sesat.no |
 | http://tech.finn.no   | Java XSS Filter |



Re: OOM on CompressionMetadata.readChunkOffsets(..)

2011-10-31 Thread Sylvain Lebresne
On Mon, Oct 31, 2011 at 2:58 PM, Sylvain Lebresne sylv...@datastax.com wrote:
 On Mon, Oct 31, 2011 at 1:10 PM, Mick Semb Wever m...@apache.org wrote:
 On Mon, 2011-10-31 at 13:05 +0100, Mick Semb Wever wrote:
 Given a 60G sstable, even with 64kb chunk_length, to read just that one
 sstable requires close to 8G free heap memory...

 Arg, that calculation was a little off...
  (a long isn't exactly 8K...)

 But you get my concern...

 Well, with a long being only 8 bytes, that's 8MB of free heap memory. Without
 being negligible, that's not completely crazy to me.

 No, the problem is that we create those 8MB for each reads, which *is* crazy
 (the fact that we allocate those 8MB in one block is not very nice for
 the GC either
 but that's another problem).
 Anyway, that's really a bug and I've created CASSANDRA-3427 to fix.

Note that it's only a problem for range queries.

--
Sylvain


 --
 Sylvain


 ~mck

 --
 When you say: I wrote a program that crashed Windows, people just
 stare at you blankly and say: Hey, I got those with the system -- for
 free. Linus Torvalds

 | http://semb.wever.org | http://sesat.no |
 | http://tech.finn.no   | Java XSS Filter |




Re: OOM (or, what settings to use on AWS large?)

2011-06-22 Thread Sasha Dolgy
We had a similar problem a last month and found that the OS eventually
in the end killed the Cassandra process on each of our nodes ... I've
upgraded to 0.8.0 from 0.7.6-2 and have not had the problem since, but
i do see consumption levels rising consistently from one day to the
next on each node ..

On Wed, Jun 1, 2011 at 2:30 PM, Sasha Dolgy sdo...@gmail.com wrote:
 is there a specific string I should be looking for in the logs that
 isn't super obvious to me at the moment...

 On Tue, May 31, 2011 at 8:21 PM, Jonathan Ellis jbel...@gmail.com wrote:
 The place to start is with the statistics Cassandra logs after each GC.

look for GCInspector

I found this in the logs on all my servers but never did much after that

On Wed, Jun 22, 2011 at 2:33 PM, William Oberman
ober...@civicscience.com wrote:
 I woke up this morning to all 4 of 4 of my cassandra instances reporting
 they were down in my cluster.  I quickly started them all, and everything
 seems fine.  I'm doing a postmortem now, but it appears they all OOM'd at
 roughly the same time, which was not reported in any cassandra log, but I
 discovered something in /var/log/kern that showed java died of oom(*).  In
 amazon, I'm using large instances for cassandra, and they have no swap (as
 recommended), so I have ~8GB of ram.  Should I use a different max mem
 setting?  I'm using a stock rpm from riptano/datastax.  If I run ps -aux I
 get:
 /usr/bin/java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42
 -Xms3843M -Xmx3843M -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss128k
 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
 -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
 -Djava.net.preferIPv4Stack=true -Djava.rmi.server.hostname=X.X.X.X
 -Dcom.sun.management.jmxremote.port=8080
 -Dcom.sun.management.jmxremote.ssl=false
 -Dcom.sun.management.jmxremote.authenticate=false -Dmx4jaddress=0.0.0.0
 -Dmx4jport=8081 -Dlog4j.configuration=log4j-server.properties
 -Dlog4j.defaultInitOverride=true
 -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid -cp
 :/etc/cassandra/conf:/usr/share/cassandra/lib/antlr-3.1.3.jar:/usr/share/cassandra/lib/apache-cassandra-0.7.4.jar:/usr/share/cassandra/lib/avro-1.4.0-fixes.jar:/usr/share/cassandra/lib/avro-1.4.0-sources-fixes.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-collections-3.2.1.jar:/usr/share/cassandra/lib/commons-lang-2.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.1.jar:/usr/share/cassandra/lib/guava-r05.jar:/usr/share/cassandra/lib/high-scale-lib.jar:/usr/share/cassandra/lib/jackson-core-asl-1.4.0.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.4.0.jar:/usr/share/cassandra/lib/jetty-6.1.21.jar:/usr/share/cassandra/lib/jetty-util-6.1.21.jar:/usr/share/cassandra/lib/jline-0.9.94.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/jug-2.0.0.jar:/usr/share/cassandra/lib/libthrift-0.5.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassandra/lib/mx4j-tools.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/share/cassandra/lib/slf4j-api-1.6.1.jar:/usr/share/cassandra/lib/slf4j-log4j12-1.6.1.jar:/usr/share/cassandra/lib/snakeyaml-1.6.jar
 org.apache.cassandra.thrift.CassandraDaemon
 (*) Also, why would they all OOM so close to each other?  Bad luck?  Or once
 the first node went down, is there an increased chance of the rest?
 I'm still on 0.7.4, when I released cassandra to production that was the
 latest release.  In addition to (or instead of?) fixing memory settings, I'm
 guessing I should upgrade.
 will


Re: OOM (or, what settings to use on AWS large?)

2011-06-22 Thread William Oberman
Well, I managed to run 50 days before an OOM, so any changes I make will
take a while to test ;-)  I've seen the GCInspector log lines appear
periodically in my logs, but I didn't see a correlation with the crash.

I'll read the instructions on how to properly do a rolling upgrade today,
practice on test, and try that on production first.

will

On Wed, Jun 22, 2011 at 8:41 AM, Sasha Dolgy sdo...@gmail.com wrote:

 We had a similar problem a last month and found that the OS eventually
 in the end killed the Cassandra process on each of our nodes ... I've
 upgraded to 0.8.0 from 0.7.6-2 and have not had the problem since, but
 i do see consumption levels rising consistently from one day to the
 next on each node ..

 On Wed, Jun 1, 2011 at 2:30 PM, Sasha Dolgy sdo...@gmail.com wrote:
  is there a specific string I should be looking for in the logs that
  isn't super obvious to me at the moment...
 
  On Tue, May 31, 2011 at 8:21 PM, Jonathan Ellis jbel...@gmail.com
 wrote:
  The place to start is with the statistics Cassandra logs after each GC.

 look for GCInspector

 I found this in the logs on all my servers but never did much after
 that

 On Wed, Jun 22, 2011 at 2:33 PM, William Oberman
 ober...@civicscience.com wrote:
  I woke up this morning to all 4 of 4 of my cassandra instances reporting
  they were down in my cluster.  I quickly started them all, and everything
  seems fine.  I'm doing a postmortem now, but it appears they all OOM'd at
  roughly the same time, which was not reported in any cassandra log, but I
  discovered something in /var/log/kern that showed java died of oom(*).
  In
  amazon, I'm using large instances for cassandra, and they have no swap
 (as
  recommended), so I have ~8GB of ram.  Should I use a different max mem
  setting?  I'm using a stock rpm from riptano/datastax.  If I run ps
 -aux I
  get:
  /usr/bin/java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42
  -Xms3843M -Xmx3843M -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss128k
  -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
  -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
  -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
  -Djava.net.preferIPv4Stack=true -Djava.rmi.server.hostname=X.X.X.X
  -Dcom.sun.management.jmxremote.port=8080
  -Dcom.sun.management.jmxremote.ssl=false
  -Dcom.sun.management.jmxremote.authenticate=false -Dmx4jaddress=0.0.0.0
  -Dmx4jport=8081 -Dlog4j.configuration=log4j-server.properties
  -Dlog4j.defaultInitOverride=true
  -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid -cp
 
 :/etc/cassandra/conf:/usr/share/cassandra/lib/antlr-3.1.3.jar:/usr/share/cassandra/lib/apache-cassandra-0.7.4.jar:/usr/share/cassandra/lib/avro-1.4.0-fixes.jar:/usr/share/cassandra/lib/avro-1.4.0-sources-fixes.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-collections-3.2.1.jar:/usr/share/cassandra/lib/commons-lang-2.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.1.jar:/usr/share/cassandra/lib/guava-r05.jar:/usr/share/cassandra/lib/high-scale-lib.jar:/usr/share/cassandra/lib/jackson-core-asl-1.4.0.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.4.0.jar:/usr/share/cassandra/lib/jetty-6.1.21.jar:/usr/share/cassandra/lib/jetty-util-6.1.21.jar:/usr/share/cassandra/lib/jline-0.9.94.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/jug-2.0.0.jar:/usr/share/cassandra/lib/libthrift-0.5.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassandra/lib/mx4j-tools.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/share/cassandra/lib/slf4j-api-1.6.1.jar:/usr/share/cassandra/lib/slf4j-log4j12-1.6.1.jar:/usr/share/cassandra/lib/snakeyaml-1.6.jar
  org.apache.cassandra.thrift.CassandraDaemon
  (*) Also, why would they all OOM so close to each other?  Bad luck?  Or
 once
  the first node went down, is there an increased chance of the rest?
  I'm still on 0.7.4, when I released cassandra to production that was the
  latest release.  In addition to (or instead of?) fixing memory settings,
 I'm
  guessing I should upgrade.
  will




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: OOM (or, what settings to use on AWS large?)

2011-06-22 Thread Sasha Dolgy
Yes ... this is because it was the OS that killed the process, and
wasn't related to Cassandra crashing.  Reviewing our monitoring, we
saw that memory utilization was pegged at 100% for days and days
before it was finally killed because 'apt' was fighting for resource.
At least, that's as far as I got in my investigation before giving up,
moving to 0.8.0 and implementing 24hr nodetool repair on each node via
cronjobso far ... no problems.

On Wed, Jun 22, 2011 at 2:49 PM, William Oberman
ober...@civicscience.com wrote:
 Well, I managed to run 50 days before an OOM, so any changes I make will
 take a while to test ;-)  I've seen the GCInspector log lines appear
 periodically in my logs, but I didn't see a correlation with the crash.
 I'll read the instructions on how to properly do a rolling upgrade today,
 practice on test, and try that on production first.
 will


Re: OOM (or, what settings to use on AWS large?)

2011-06-22 Thread William Oberman
I was wondering/I figured that /var/log/kern indicated the OS was killing
java (versus an internal OOM).

The nodetool repair is interesting.  My application never deletes, so I
didn't bother running it.  But, if that helps prevent OOMs as well, I'll add
it to the crontab

(plan A is still upgrading to 0.8.0).

will

On Wed, Jun 22, 2011 at 8:53 AM, Sasha Dolgy sdo...@gmail.com wrote:

 Yes ... this is because it was the OS that killed the process, and
 wasn't related to Cassandra crashing.  Reviewing our monitoring, we
 saw that memory utilization was pegged at 100% for days and days
 before it was finally killed because 'apt' was fighting for resource.
 At least, that's as far as I got in my investigation before giving up,
 moving to 0.8.0 and implementing 24hr nodetool repair on each node via
 cronjobso far ... no problems.

 On Wed, Jun 22, 2011 at 2:49 PM, William Oberman
 ober...@civicscience.com wrote:
  Well, I managed to run 50 days before an OOM, so any changes I make will
  take a while to test ;-)  I've seen the GCInspector log lines appear
  periodically in my logs, but I didn't see a correlation with the crash.
  I'll read the instructions on how to properly do a rolling upgrade today,
  practice on test, and try that on production first.
  will




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: OOM (or, what settings to use on AWS large?)

2011-06-22 Thread William Oberman
The CLI is posted, I assume that's the defaults (I didn't touch anything).
The machines basically just run cassandra (and standard Centos5 background
stuff).

will

On Wed, Jun 22, 2011 at 9:49 AM, Jake Luciani jak...@gmail.com wrote:

 Are you running with the default heap settings? what else is running on the
 boxes?



 On Wed, Jun 22, 2011 at 9:06 AM, William Oberman ober...@civicscience.com
  wrote:

 I was wondering/I figured that /var/log/kern indicated the OS was killing
 java (versus an internal OOM).

 The nodetool repair is interesting.  My application never deletes, so I
 didn't bother running it.  But, if that helps prevent OOMs as well, I'll add
 it to the crontab

 (plan A is still upgrading to 0.8.0).

 will


 On Wed, Jun 22, 2011 at 8:53 AM, Sasha Dolgy sdo...@gmail.com wrote:

 Yes ... this is because it was the OS that killed the process, and
 wasn't related to Cassandra crashing.  Reviewing our monitoring, we
 saw that memory utilization was pegged at 100% for days and days
 before it was finally killed because 'apt' was fighting for resource.
 At least, that's as far as I got in my investigation before giving up,
 moving to 0.8.0 and implementing 24hr nodetool repair on each node via
 cronjobso far ... no problems.

 On Wed, Jun 22, 2011 at 2:49 PM, William Oberman
 ober...@civicscience.com wrote:
  Well, I managed to run 50 days before an OOM, so any changes I make
 will
  take a while to test ;-)  I've seen the GCInspector log lines appear
  periodically in my logs, but I didn't see a correlation with the crash.
  I'll read the instructions on how to properly do a rolling upgrade
 today,
  practice on test, and try that on production first.
  will




 --
 Will Oberman
 Civic Science, Inc.
 3030 Penn Avenue., First Floor
 Pittsburgh, PA 15201
 (M) 412-480-7835
 (E) ober...@civicscience.com




 --
 http://twitter.com/tjake




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: OOM (or, what settings to use on AWS large?)

2011-06-22 Thread Chris Burroughs
On 06/22/2011 08:53 AM, Sasha Dolgy wrote:
 Yes ... this is because it was the OS that killed the process, and
 wasn't related to Cassandra crashing.  Reviewing our monitoring, we
 saw that memory utilization was pegged at 100% for days and days
 before it was finally killed because 'apt' was fighting for resource.
 At least, that's as far as I got in my investigation before giving up,
 moving to 0.8.0 and implementing 24hr nodetool repair on each node via
 cronjobso far ... no problems.

In `free` terms, by pegged do you mean that free Mem was 0, or -/+
buffers/cache as 0?


Re: OOM (or, what settings to use on AWS large?)

2011-06-22 Thread Sasha Dolgy
http://www.twitpic.com/5fdabn
http://www.twitpic.com/5fdbdg

i do love a good graph.  two of the weekly memory utilization graphs
for 2 of the 4 servers from this ring... week 21 was a nice week ...
the week before 0.8.0 went out proper.  since then, bumped up to 0.8
and have seen a steady increase in the memory consumption (used) but
have not seen the swap do what it did ...and the buffered/cached seems
much better

-sd

On Thu, Jun 23, 2011 at 12:09 AM, Chris Burroughs
chris.burrou...@gmail.com wrote:

 In `free` terms, by pegged do you mean that free Mem was 0, or -/+
 buffers/cache as 0?


Re: OOM (or, what settings to use on AWS large?)

2011-06-22 Thread Chris Burroughs
Do all of the reductions in Used on that graph correspond to node restarts?

My Zabbix for reference: http://img194.imageshack.us/img194/383/2weekmem.png


On 06/22/2011 06:35 PM, Sasha Dolgy wrote:
 http://www.twitpic.com/5fdabn
 http://www.twitpic.com/5fdbdg
 
 i do love a good graph.  two of the weekly memory utilization graphs
 for 2 of the 4 servers from this ring... week 21 was a nice week ...
 the week before 0.8.0 went out proper.  since then, bumped up to 0.8
 and have seen a steady increase in the memory consumption (used) but
 have not seen the swap do what it did ...and the buffered/cached seems
 much better
 
 -sd
 
 On Thu, Jun 23, 2011 at 12:09 AM, Chris Burroughs
 chris.burrou...@gmail.com wrote:

 In `free` terms, by pegged do you mean that free Mem was 0, or -/+
 buffers/cache as 0?



Re: OOM (or, what settings to use on AWS large?)

2011-06-22 Thread Sasha Dolgy
yes.  each one corresponds with taking a node down for various
reasons.  i think more people should show their graphs.  it's great.
hoping Oberman has some.so we can see what his look like ,,

On Thu, Jun 23, 2011 at 12:40 AM, Chris Burroughs
chris.burrou...@gmail.com wrote:
 Do all of the reductions in Used on that graph correspond to node restarts?

 My Zabbix for reference: http://img194.imageshack.us/img194/383/2weekmem.png


 On 06/22/2011 06:35 PM, Sasha Dolgy wrote:
 http://www.twitpic.com/5fdabn
 http://www.twitpic.com/5fdbdg

 i do love a good graph.  two of the weekly memory utilization graphs
 for 2 of the 4 servers from this ring... week 21 was a nice week ...
 the week before 0.8.0 went out proper.  since then, bumped up to 0.8
 and have seen a steady increase in the memory consumption (used) but
 have not seen the swap do what it did ...and the buffered/cached seems
 much better

 -sd

 On Thu, Jun 23, 2011 at 12:09 AM, Chris Burroughs
 chris.burrou...@gmail.com wrote:

 In `free` terms, by pegged do you mean that free Mem was 0, or -/+
 buffers/cache as 0?





-- 
Sasha Dolgy
sasha.do...@gmail.com


Re: OOM during restart

2011-06-21 Thread aaron morton
AFAIK the node will not announce itself in the ring until the log replay is 
complete, so it will not get the schema update until after log replay. If 
possible i'd avoid making the schema change until you have solved this problem.

My theory on OOM during log replay is that the high speed inserts are a good 
way of finding out if the maximum memory required by the schema is too big to 
fit in the JVM. How big is the max JVM Heap SIze and do you have a lot of CF's?

The simple solution it to either (temporarily) increase the JVM Heap Size or 
move the log files so that the server can process only one at a time. The JVM 
option D.cassandra_ring=false will stop the node from joining the cluster and 
stop other nodes sending requests to it until you have sorted it out. 

Hope that helps. 
  
 
-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 21 Jun 2011, at 10:24, Gabriel Ki wrote:

 Hi,
 
 Cassandra: 7.6-2
 I was restarting a node and ran into OOM while replaying the commit log.  I 
 am not able to bring the node up again.
 
 DEBUG 15:11:43,501 forceFlush requested but everything is clean  
   For this I don't know what to do.
 java.lang.OutOfMemoryError: Java heap space
 at 
 org.apache.cassandra.io.util.BufferedRandomAccessFile.init(BufferedRandomAccessFile.java:123)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.init(SSTableWriter.java:395)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.init(SSTableWriter.java:76)
 at 
 org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2238)
 at org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:166)
 at org.apache.cassandra.db.Memtable.access$000(Memtable.java:49)
 at org.apache.cassandra.db.Memtable$1.runMayThrow(Memtable.java:189)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 
 Any help will be appreciated.   
 
 If I update the schema while a node is down, the new schema is loaded before 
 the flushing when the node is brought up again, correct?  
 
 Thanks,
 -gabe



Re: OOM during restart

2011-06-21 Thread Dominic Williams
Hi gabe,

What you need to do is the following:

1. Adjust cassandra.yaml so when this node is starting up it is not
contacted by other nodes e.g. set thrift to 9061 and storage to 7001

2. Copy your commit logs to tmp sub-folder e.g. commitLog/tmp

3. Copy a small number of commit logs back into main commit log folder (be
careful to copy the id.log and id.log.header file together)

4. Start up the node. When it has successfully started up, and therefore you
know it has processed the commit logs, go back to step 3 and repeat

5. When you have no more commit logs remaining in tmp, you can revert
cassandra.yaml and restart.. your node should be up again

You might want to read
http://ria101.wordpress.com/2011/02/08/cassandra-the-importance-of-system-clocks-avoiding-oom-and-how-to-escape-oom-meltdown/

With Version 0.8 you can set a global memory threshold for the memtables so
this kind of problem should become greatly reduced

Best, Dominic

On 20 June 2011 23:24, Gabriel Ki gab...@gmail.com wrote:

 Hi,

 Cassandra: 7.6-2
 I was restarting a node and ran into OOM while replaying the commit log.  I
 am not able to bring the node up again.

 DEBUG 15:11:43,501 forceFlush requested but everything is clean
   For this I don't know what to do.
 java.lang.OutOfMemoryError: Java heap space
 at
 org.apache.cassandra.io.util.BufferedRandomAccessFile.init(BufferedRandomAccessFile.java:123)
 at
 org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.init(SSTableWriter.java:395)
 at
 org.apache.cassandra.io.sstable.SSTableWriter.init(SSTableWriter.java:76)
 at
 org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2238)
 at
 org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:166)
 at org.apache.cassandra.db.Memtable.access$000(Memtable.java:49)
 at org.apache.cassandra.db.Memtable$1.runMayThrow(Memtable.java:189)
 at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)

 Any help will be appreciated.

 If I update the schema while a node is down, the new schema is loaded
 before the flushing when the node is brought up again, correct?

 Thanks,
 -gabe



Re: OOM during restart

2011-06-21 Thread Jonathan Ellis
If you're OOMing on restart you WILL OOM during normal usage given
heavy enough write load.  Definitely adjust memtable thresholds down
or, as Dominic suggests, upgrade to 0.8.

On Tue, Jun 21, 2011 at 12:02 PM, Dominic Williams
dwilli...@system7.co.uk wrote:
 Hi gabe,
 What you need to do is the following:
 1. Adjust cassandra.yaml so when this node is starting up it is not
 contacted by other nodes e.g. set thrift to 9061 and storage to 7001
 2. Copy your commit logs to tmp sub-folder e.g. commitLog/tmp
 3. Copy a small number of commit logs back into main commit log folder (be
 careful to copy the id.log and id.log.header file together)
 4. Start up the node. When it has successfully started up, and therefore you
 know it has processed the commit logs, go back to step 3 and repeat
 5. When you have no more commit logs remaining in tmp, you can revert
 cassandra.yaml and restart.. your node should be up again
 You might want to
 read http://ria101.wordpress.com/2011/02/08/cassandra-the-importance-of-system-clocks-avoiding-oom-and-how-to-escape-oom-meltdown/
 With Version 0.8 you can set a global memory threshold for the memtables so
 this kind of problem should become greatly reduced
 Best, Dominic

 On 20 June 2011 23:24, Gabriel Ki gab...@gmail.com wrote:

 Hi,

 Cassandra: 7.6-2
 I was restarting a node and ran into OOM while replaying the commit log.
 I am not able to bring the node up again.

 DEBUG 15:11:43,501 forceFlush requested but everything is clean
   For this I don't know what to do.
 java.lang.OutOfMemoryError: Java heap space
     at
 org.apache.cassandra.io.util.BufferedRandomAccessFile.init(BufferedRandomAccessFile.java:123)
     at
 org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.init(SSTableWriter.java:395)
     at
 org.apache.cassandra.io.sstable.SSTableWriter.init(SSTableWriter.java:76)
     at
 org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2238)
     at
 org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:166)
     at org.apache.cassandra.db.Memtable.access$000(Memtable.java:49)
     at org.apache.cassandra.db.Memtable$1.runMayThrow(Memtable.java:189)
     at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
     at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
     at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
     at java.lang.Thread.run(Thread.java:662)

 Any help will be appreciated.

 If I update the schema while a node is down, the new schema is loaded
 before the flushing when the node is brought up again, correct?

 Thanks,
 -gabe





-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: OOM recovering failed node with many CFs

2011-05-26 Thread Jonathan Ellis
Sounds like a legitimate bug, although looking through the code I'm
not sure what would cause a tight retry loop on migration
announce/rectify. Can you create a ticket at
https://issues.apache.org/jira/browse/CASSANDRA ?

As a workaround, I would try manually copying the Migrations and
Schema sstable files from the system keyspace of the live node, then
restart the recovering one.

On Thu, May 26, 2011 at 9:27 AM, Flavio Baronti
f.baro...@list-group.com wrote:
 I can't seem to be able to recover a failed node on a database where i did
 many updates to the schema.

 I have a small cluster with 2 nodes, around 1000 CF (I know it's a lot, but
 it can't be changed right now), and ReplicationFactor=2.
 I shut down a node and cleaned its data entirely, then tried to bring it
 back up. The node starts fetching schema updates from the live node, but the
 operation fails halfway with an OOME.
 After some investigation, what I found is that:

 - I have a lot of schema updates (there are 2067 rows in the system.Schema
 CF).
 - The live node loads migrations 1-1000, and sends them to the recovering
 node (Migration.getLocalMigrations())
 - Soon afterwards, the live node checks the schema version on the recovering
 node and finds it has moved by a little - say it has applied the first 3
 migrations. It then loads migrations 3-1003, and sends them to the node.
 - This process is repeated very quickly (sends migrations 6-1006, 9-1009,
 etc).

 Analyzing the memory dump and the logs, it looks like each of these 1000
 migration blocks are composed in a single message and sent to the
 OutboundTcpConnection queue. However, since the schema is big, the messages
 occupy a lot of space, and are built faster than the connection can send
 them. Therefore, they accumulate in OutboundTcpConnection.queue, until
 memory is completely filled.

 Any suggestions? Can I change something to make this work, apart from
 reducing the number of CFs?

 Flavio




-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: OOM recovering failed node with many CFs

2011-05-26 Thread Flavio Baronti

I tried the manual copy you suggest, but the SystemTable.checkHealth() function
complains it can't load the system files. Log follows, I will gather some more
info and create a ticket as soon as possible.

 INFO [main] 2011-05-26 18:25:36,147 AbstractCassandraDaemon.java Logging 
initialized
 INFO [main] 2011-05-26 18:25:36,172 AbstractCassandraDaemon.java Heap size: 
4277534720/4277534720
 INFO [main] 2011-05-26 18:25:36,174 CLibrary.java JNA not found. Native 
methods will be disabled.
 INFO [main] 2011-05-26 18:25:36,190 DatabaseDescriptor.java Loading settings from 
file:/C:/Cassandra/conf/hscassandra9170.yaml
 INFO [main] 2011-05-26 18:25:36,344 DatabaseDescriptor.java DiskAccessMode 'auto' determined to be mmap, 
indexAccessMode is mmap

 INFO [main] 2011-05-26 18:25:36,532 SSTableReader.java Opening 
G:\Cassandra\data\system\Schema-f-2746
 INFO [main] 2011-05-26 18:25:36,577 SSTableReader.java Opening 
G:\Cassandra\data\system\Schema-f-2729
 INFO [main] 2011-05-26 18:25:36,590 SSTableReader.java Opening 
G:\Cassandra\data\system\Schema-f-2745
 INFO [main] 2011-05-26 18:25:36,599 SSTableReader.java Opening 
G:\Cassandra\data\system\Migrations-f-2167
 INFO [main] 2011-05-26 18:25:36,600 SSTableReader.java Opening 
G:\Cassandra\data\system\Migrations-f-2131
 INFO [main] 2011-05-26 18:25:36,602 SSTableReader.java Opening 
G:\Cassandra\data\system\Migrations-f-1041
 INFO [main] 2011-05-26 18:25:36,603 SSTableReader.java Opening 
G:\Cassandra\data\system\Migrations-f-1695
ERROR [main] 2011-05-26 18:25:36,634 AbstractCassandraDaemon.java Fatal 
exception during initialization
org.apache.cassandra.config.ConfigurationException: Found system table files, but they couldn't be loaded. Did you 
change the partitioner?

at org.apache.cassandra.db.SystemTable.checkHealth(SystemTable.java:236)
at 
org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:127)
 
at 
org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:314)
at 
org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:79)


Il 5/26/2011 6:04 PM, Jonathan Ellis ha scritto:

Sounds like a legitimate bug, although looking through the code I'm
not sure what would cause a tight retry loop on migration
announce/rectify. Can you create a ticket at
https://issues.apache.org/jira/browse/CASSANDRA ?

As a workaround, I would try manually copying the Migrations and
Schema sstable files from the system keyspace of the live node, then
restart the recovering one.

On Thu, May 26, 2011 at 9:27 AM, Flavio Baronti
f.baro...@list-group.com  wrote:

I can't seem to be able to recover a failed node on a database where i did
many updates to the schema.

I have a small cluster with 2 nodes, around 1000 CF (I know it's a lot, but
it can't be changed right now), and ReplicationFactor=2.
I shut down a node and cleaned its data entirely, then tried to bring it
back up. The node starts fetching schema updates from the live node, but the
operation fails halfway with an OOME.
After some investigation, what I found is that:

- I have a lot of schema updates (there are 2067 rows in the system.Schema
CF).
- The live node loads migrations 1-1000, and sends them to the recovering
node (Migration.getLocalMigrations())
- Soon afterwards, the live node checks the schema version on the recovering
node and finds it has moved by a little - say it has applied the first 3
migrations. It then loads migrations 3-1003, and sends them to the node.
- This process is repeated very quickly (sends migrations 6-1006, 9-1009,
etc).

Analyzing the memory dump and the logs, it looks like each of these 1000
migration blocks are composed in a single message and sent to the
OutboundTcpConnection queue. However, since the schema is big, the messages
occupy a lot of space, and are built faster than the connection can send
them. Therefore, they accumulate in OutboundTcpConnection.queue, until
memory is completely filled.

Any suggestions? Can I change something to make this work, apart from
reducing the number of CFs?

Flavio









Re: OOM recovering failed node with many CFs

2011-05-26 Thread Jonathan Ellis
We've applied a fix to the 0.7 branch in
https://issues.apache.org/jira/browse/CASSANDRA-2714.  The patch
probably applies to 0.7.6 as well.

On Thu, May 26, 2011 at 11:36 AM, Flavio Baronti
f.baro...@list-group.com wrote:
 I tried the manual copy you suggest, but the SystemTable.checkHealth()
 function
 complains it can't load the system files. Log follows, I will gather some
 more
 info and create a ticket as soon as possible.

  INFO [main] 2011-05-26 18:25:36,147 AbstractCassandraDaemon.java Logging
 initialized
  INFO [main] 2011-05-26 18:25:36,172 AbstractCassandraDaemon.java Heap size:
 4277534720/4277534720
  INFO [main] 2011-05-26 18:25:36,174 CLibrary.java JNA not found. Native
 methods will be disabled.
  INFO [main] 2011-05-26 18:25:36,190 DatabaseDescriptor.java Loading
 settings from file:/C:/Cassandra/conf/hscassandra9170.yaml
  INFO [main] 2011-05-26 18:25:36,344 DatabaseDescriptor.java DiskAccessMode
 'auto' determined to be mmap, indexAccessMode is mmap
  INFO [main] 2011-05-26 18:25:36,532 SSTableReader.java Opening
 G:\Cassandra\data\system\Schema-f-2746
  INFO [main] 2011-05-26 18:25:36,577 SSTableReader.java Opening
 G:\Cassandra\data\system\Schema-f-2729
  INFO [main] 2011-05-26 18:25:36,590 SSTableReader.java Opening
 G:\Cassandra\data\system\Schema-f-2745
  INFO [main] 2011-05-26 18:25:36,599 SSTableReader.java Opening
 G:\Cassandra\data\system\Migrations-f-2167
  INFO [main] 2011-05-26 18:25:36,600 SSTableReader.java Opening
 G:\Cassandra\data\system\Migrations-f-2131
  INFO [main] 2011-05-26 18:25:36,602 SSTableReader.java Opening
 G:\Cassandra\data\system\Migrations-f-1041
  INFO [main] 2011-05-26 18:25:36,603 SSTableReader.java Opening
 G:\Cassandra\data\system\Migrations-f-1695
 ERROR [main] 2011-05-26 18:25:36,634 AbstractCassandraDaemon.java Fatal
 exception during initialization
 org.apache.cassandra.config.ConfigurationException: Found system table
 files, but they couldn't be loaded. Did you change the partitioner?
        at
 org.apache.cassandra.db.SystemTable.checkHealth(SystemTable.java:236)
        at
 org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:127)
        at
 org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:314)
        at
 org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:79)


 Il 5/26/2011 6:04 PM, Jonathan Ellis ha scritto:

 Sounds like a legitimate bug, although looking through the code I'm
 not sure what would cause a tight retry loop on migration
 announce/rectify. Can you create a ticket at
 https://issues.apache.org/jira/browse/CASSANDRA ?

 As a workaround, I would try manually copying the Migrations and
 Schema sstable files from the system keyspace of the live node, then
 restart the recovering one.

 On Thu, May 26, 2011 at 9:27 AM, Flavio Baronti
 f.baro...@list-group.com  wrote:

 I can't seem to be able to recover a failed node on a database where i
 did
 many updates to the schema.

 I have a small cluster with 2 nodes, around 1000 CF (I know it's a lot,
 but
 it can't be changed right now), and ReplicationFactor=2.
 I shut down a node and cleaned its data entirely, then tried to bring it
 back up. The node starts fetching schema updates from the live node, but
 the
 operation fails halfway with an OOME.
 After some investigation, what I found is that:

 - I have a lot of schema updates (there are 2067 rows in the
 system.Schema
 CF).
 - The live node loads migrations 1-1000, and sends them to the recovering
 node (Migration.getLocalMigrations())
 - Soon afterwards, the live node checks the schema version on the
 recovering
 node and finds it has moved by a little - say it has applied the first 3
 migrations. It then loads migrations 3-1003, and sends them to the node.
 - This process is repeated very quickly (sends migrations 6-1006, 9-1009,
 etc).

 Analyzing the memory dump and the logs, it looks like each of these 1000
 migration blocks are composed in a single message and sent to the
 OutboundTcpConnection queue. However, since the schema is big, the
 messages
 occupy a lot of space, and are built faster than the connection can send
 them. Therefore, they accumulate in OutboundTcpConnection.queue, until
 memory is completely filled.

 Any suggestions? Can I change something to make this work, apart from
 reducing the number of CFs?

 Flavio









-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: OOM on heavy write load

2011-04-28 Thread Thibaut Britz
Could this be related as well to
https://issues.apache.org/jira/browse/CASSANDRA-2463?

Thibaut


On Wed, Apr 27, 2011 at 11:35 PM, Aaron Morton aa...@thelastpickle.comwrote:

 I'm a bit confused by the two different cases you described, so cannot
 comment specially on your case.

 In general if Cassandra is slowing down take a look at the thread pool
 stats, using nodetool tpstats to see where it is backing up and take at look
 at the logs to check for excessive GC. If node stats shows the read or
 mutation stage backing up, check the iostats.

 Hope that helps.
 Aaron

 On 28/04/2011, at 12:32 AM, Nikolay Kоvshov nkovs...@yandex.ru wrote:

  I have set quite low memory consumption (see my configuration in first
 message) and give Cassandra 2.7 Gb of memory.
  I cache 1M of 64-bytes keys + 64 Mb memtables. I believe overhead can't
 be 500% or so ?
 
  memtable operations in millions = default 0.3
 
  I see now very strange behaviour
 
  If i fill Cassandra with, say, 100 millions of 64B key + 64B data and
 after that I start inserting 64B key + 64 KB data, compaction queue
 immediately grows to hundreds and cassandra extremely slows down (it makes
 aroung 30-50 operations/sec), then starts to give timeout errors.
 
  But if I insert 64B key + 64 KB data from the very beginning, cassandra
 works fine and makes around 300 operations/sec even when the database
 exceeds 2-3 TB
 
  The behaviour is quite complex and i cannot predict the effect of my
 actions. What consequences I will have if I turn off compaction ? Can i read
 somewhere about what is compaction and why it so heavily depends on what and
 in which order i write into cassandra ?
 
  26.04.2011, 00:08, Shu Zhang szh...@mediosystems.com:
  the way I measure actual memtable row sizes is this
 
  write X rows into a cassandra node
  trigger GC
  record heap usage
  trigger compaction and GC
  record heap savings and divide by X for actual cassandra memtable row
 size in memory
 
  Similar process to measure per-key/per-row cache sizes for your data. To
 understand your memtable row overhead size, you can do the above exercise
 with very different data sizes.
 
  Also, you probably know this, but when setting your memory usage ceiling
 or heap size, make sure to leave a few hundred MBs for GC.
  
  From: Shu Zhang [szh...@mediosystems.com]
  Sent: Monday, April 25, 2011 12:55 PM
  To: user@cassandra.apache.org
  Subject: RE: OOM on heavy write load
 
  How large are your rows? binary_memtable_throughput_in_
  mb only tracks size of data, but there is an overhead associated with
 each row on the order of magnitude of a few KBs. If your row data sizes are
 really small then the overhead dominates the memory usage and
 binary_memtable_throughput_in_
  mb end up not limiting your memory usage the way you'd expect. It's a
 good idea to specify memtable_operations_in_millions in that case. If you're
 not sure how big your data is compared to memtable overhead, it's a good
 idea to specify both parameters to effectively put in a memory ceiling no
 matter which dominates your actual memory usage.
 
  It could also be that your key cache is too big, you should measure your
 key sizes and make sure you have enough memory to cache 1m keys (along with
 your memtables). Finally if you have multiple keyspaces (for multiple
 applications) on your cluster, they all share the total available heap, so
 you have to account for that.
 
  There's no measure in cassandra to guard against OOM, you must configure
 nodes such that the max memory usage on each node, that is max size all your
 caches and memtables can potentially grow to, is less than your heap size.
  
  From: Nikolay Kоvshov [nkovs...@yandex.ru]
  Sent: Monday, April 25, 2011 5:21 AM
  To: user@cassandra.apache.org
  Subject: Re: OOM on heavy write load
 
  I assume if I turn off swap it will just die earlier, no ? What is the
 mechanism of dying ?
 
  From the link you provided
 
  # Row cache is too large, or is caching large rows
  my row_cache is 0
 
  # The memtable sizes are too large for the amount of heap allocated to
 the JVM
  Is my memtable size too large ? I have made it less to surely fit the
 magical formula
 
  Trying to analyze heap dumps gives me the following:
 
  In one case diagram has 3 Memtables about 64 Mb each + 72 Mb Thread +
 700 Mb Unreachable objects
 
  suspected threats:
  7 instances of org.apache.cassandra.db.Memtable, loaded by
 sun.misc.Launcher$AppClassLoader @ 0x7f29f4992d68 occupy 456,292,912
 (48.36%) bytes.
  25,211 instances of org.apache.cassandra.io.sstable.SSTableReader,
 loaded by sun.misc.Launcher$AppClassLoader @ 0x7f29f4992d68 occupy
 294,908,984 (31.26%) byte
  72 instances of java.lang.Thread, loaded by system class loader
 occupy 143,632,624 (15.22%) bytes.
 
  In other cases memory analyzer hangs trying to parse 2Gb dump
 
  22.04.2011, 17:26, Jonathan Ellis jbel...@gmail.com

Re: OOM on heavy write load

2011-04-28 Thread Peter Schuller
 Could this be related as well to
 https://issues.apache.org/jira/browse/CASSANDRA-2463?

My gut feel: Maybe, if the slowness/timeouts reported by the OP are
intermixed with periods of activity to indicate compacting full gc.

OP: Check if cassandra is going into 100% (not less, not more) CPU
usage during periods of timeouts. If the huge allocations fail due to
fragmentation and fallback to Full GC that might be an expected
result. Else -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps.

-- 
/ Peter Schuller


Re: OOM on heavy write load

2011-04-28 Thread Peter Schuller
 My gut feel: Maybe, if the slowness/timeouts reported by the OP are
 intermixed with periods of activity to indicate compacting full gc.

But even then, after taking a single full GC the behavior should
disappear since there should be no left-overs from the smaller columns
causing fragmentation issues. *Maybe* after two full gc:s tops if the
first happens while there's a mix still active in memtables.

-- 
/ Peter Schuller


Re: OOM on heavy write load

2011-04-27 Thread Aaron Morton
I'm a bit confused by the two different cases you described, so cannot comment 
specially on your case.

In general if Cassandra is slowing down take a look at the thread pool stats, 
using nodetool tpstats to see where it is backing up and take at look at the 
logs to check for excessive GC. If node stats shows the read or mutation stage 
backing up, check the iostats.

Hope that helps.
Aaron

On 28/04/2011, at 12:32 AM, Nikolay Kоvshov nkovs...@yandex.ru wrote:

 I have set quite low memory consumption (see my configuration in first 
 message) and give Cassandra 2.7 Gb of memory. 
 I cache 1M of 64-bytes keys + 64 Mb memtables. I believe overhead can't be 
 500% or so ?
 
 memtable operations in millions = default 0.3 
 
 I see now very strange behaviour
 
 If i fill Cassandra with, say, 100 millions of 64B key + 64B data and after 
 that I start inserting 64B key + 64 KB data, compaction queue immediately 
 grows to hundreds and cassandra extremely slows down (it makes aroung 30-50 
 operations/sec), then starts to give timeout errors. 
 
 But if I insert 64B key + 64 KB data from the very beginning, cassandra works 
 fine and makes around 300 operations/sec even when the database exceeds 2-3 TB
 
 The behaviour is quite complex and i cannot predict the effect of my actions. 
 What consequences I will have if I turn off compaction ? Can i read somewhere 
 about what is compaction and why it so heavily depends on what and in which 
 order i write into cassandra ?
 
 26.04.2011, 00:08, Shu Zhang szh...@mediosystems.com:
 the way I measure actual memtable row sizes is this
 
 write X rows into a cassandra node
 trigger GC
 record heap usage
 trigger compaction and GC
 record heap savings and divide by X for actual cassandra memtable row size 
 in memory
 
 Similar process to measure per-key/per-row cache sizes for your data. To 
 understand your memtable row overhead size, you can do the above exercise 
 with very different data sizes.
 
 Also, you probably know this, but when setting your memory usage ceiling or 
 heap size, make sure to leave a few hundred MBs for GC.
 
 From: Shu Zhang [szh...@mediosystems.com]
 Sent: Monday, April 25, 2011 12:55 PM
 To: user@cassandra.apache.org
 Subject: RE: OOM on heavy write load
 
 How large are your rows? binary_memtable_throughput_in_
 mb only tracks size of data, but there is an overhead associated with each 
 row on the order of magnitude of a few KBs. If your row data sizes are 
 really small then the overhead dominates the memory usage and 
 binary_memtable_throughput_in_
 mb end up not limiting your memory usage the way you'd expect. It's a good 
 idea to specify memtable_operations_in_millions in that case. If you're not 
 sure how big your data is compared to memtable overhead, it's a good idea to 
 specify both parameters to effectively put in a memory ceiling no matter 
 which dominates your actual memory usage.
 
 It could also be that your key cache is too big, you should measure your key 
 sizes and make sure you have enough memory to cache 1m keys (along with your 
 memtables). Finally if you have multiple keyspaces (for multiple 
 applications) on your cluster, they all share the total available heap, so 
 you have to account for that.
 
 There's no measure in cassandra to guard against OOM, you must configure 
 nodes such that the max memory usage on each node, that is max size all your 
 caches and memtables can potentially grow to, is less than your heap size.
 
 From: Nikolay Kоvshov [nkovs...@yandex.ru]
 Sent: Monday, April 25, 2011 5:21 AM
 To: user@cassandra.apache.org
 Subject: Re: OOM on heavy write load
 
 I assume if I turn off swap it will just die earlier, no ? What is the 
 mechanism of dying ?
 
 From the link you provided
 
 # Row cache is too large, or is caching large rows
 my row_cache is 0
 
 # The memtable sizes are too large for the amount of heap allocated to the 
 JVM
 Is my memtable size too large ? I have made it less to surely fit the 
 magical formula
 
 Trying to analyze heap dumps gives me the following:
 
 In one case diagram has 3 Memtables about 64 Mb each + 72 Mb Thread + 700 
 Mb Unreachable objects
 
 suspected threats:
 7 instances of org.apache.cassandra.db.Memtable, loaded by 
 sun.misc.Launcher$AppClassLoader @ 0x7f29f4992d68 occupy 456,292,912 
 (48.36%) bytes.
 25,211 instances of org.apache.cassandra.io.sstable.SSTableReader, loaded 
 by sun.misc.Launcher$AppClassLoader @ 0x7f29f4992d68 occupy 294,908,984 
 (31.26%) byte
 72 instances of java.lang.Thread, loaded by system class loader occupy 
 143,632,624 (15.22%) bytes.
 
 In other cases memory analyzer hangs trying to parse 2Gb dump
 
 22.04.2011, 17:26, Jonathan Ellis jbel...@gmail.com;;:
 
   (0) turn off swap
   (1) 
 http://www.datastax.com/docs/0.7/troubleshooting/index#nodes-are-dying-with-oom-errors
 
   On Fri, Apr 22, 2011 at 8:00 AM, Nikolay Kоvshov nkovs

  1   2   >