Re: cassandra-shuffle time to completion and required disk space

2013-05-01 Thread Richard Low
Hi John,

 - Each machine needed enough free diskspace to potentially hold the
entire cluster's sstables on disk

I wrote a possible explanation for why Cassandra is trying to use too much
space on your ticket:

https://issues.apache.org/jira/browse/CASSANDRA-5525

if you could provide the information there we can hopefully fix it.

Richard.


Repair session failed

2013-05-01 Thread Haithem Jarraya
Hi,

I am seeing this error message during repair,

 INFO [AntiEntropyStage:1] 2013-05-01 14:30:54,300 AntiEntropyService.java
(line 764) [repair #ed104480-b26a-11e2-af9b-05179fa66b76] mycolumnfamily is
fully synced (1 remaining column family to sync for this session)
ERROR [Thread-12725] 2013-05-01 14:30:54,304 StorageService.java (line
2420) Repair session failed:
java.lang.IllegalArgumentException: Requested range intersects a local
range but is not fully contained in one; this would lead to imprecise repair
at
org.apache.cassandra.service.AntiEntropyService.getNeighbors(AntiEntropyService.java:175)
at
org.apache.cassandra.service.AntiEntropyService$RepairSession.init(AntiEntropyService.java:621)
at
org.apache.cassandra.service.AntiEntropyService$RepairSession.init(AntiEntropyService.java:610)
at
org.apache.cassandra.service.AntiEntropyService.submitRepairSession(AntiEntropyService.java:127)
at
org.apache.cassandra.service.StorageService.forceTableRepair(StorageService.java:2480)
at
org.apache.cassandra.service.StorageService$4.runMayThrow(StorageService.java:2416)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.lang.Thread.run(Thread.java:662)


what does it mean imprecise repair?
Is it maybe because I went over the gcgrade period?
What you do if you go over that period?
Any hint will be valuable,
Also I noticed when I run a repair on different node, I see a message like
this

[2013-05-01 14:30:54,305] Starting repair command #5, repairing 1120 ranges
for keyspace struqrealtime

I have couple of questions, why I have repair command #5?
And why the ranges values changes from one node to another?


Many Thanks,

H


RE: Repair session failed

2013-05-01 Thread moshe.kranc
Sounds like a job for nodetool scrub, which rewrites the SStable rows in the 
correct order. After the scrub, nodetool repair should succeed.

From: Haithem Jarraya [mailto:haithem.jarr...@struq.com]
Sent: Wednesday, May 01, 2013 5:46 PM
To: user@cassandra.apache.org
Subject: Repair session failed

Hi,

I am seeing this error message during repair,

 INFO [AntiEntropyStage:1] 2013-05-01 14:30:54,300 AntiEntropyService.java 
(line 764) [repair #ed104480-b26a-11e2-af9b-05179fa66b76] mycolumnfamily is 
fully synced (1 remaining column family to sync for this session)
ERROR [Thread-12725] 2013-05-01 14:30:54,304 StorageService.java (line 2420) 
Repair session failed:
java.lang.IllegalArgumentException: Requested range intersects a local range 
but is not fully contained in one; this would lead to imprecise repair
at 
org.apache.cassandra.service.AntiEntropyService.getNeighbors(AntiEntropyService.java:175)
at 
org.apache.cassandra.service.AntiEntropyService$RepairSession.init(AntiEntropyService.java:621)
at 
org.apache.cassandra.service.AntiEntropyService$RepairSession.init(AntiEntropyService.java:610)
at 
org.apache.cassandra.service.AntiEntropyService.submitRepairSession(AntiEntropyService.java:127)
at 
org.apache.cassandra.service.StorageService.forceTableRepair(StorageService.java:2480)
at 
org.apache.cassandra.service.StorageService$4.runMayThrow(StorageService.java:2416)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.lang.Thread.run(Thread.java:662)


what does it mean imprecise repair?
Is it maybe because I went over the gcgrade period?
What you do if you go over that period?
Any hint will be valuable,
Also I noticed when I run a repair on different node, I see a message like this

[2013-05-01 14:30:54,305] Starting repair command #5, repairing 1120 ranges for 
keyspace struqrealtime

I have couple of questions, why I have repair command #5?
And why the ranges values changes from one node to another?


Many Thanks,

H

___

This message is for information purposes only, it is not a recommendation, 
advice, offer or solicitation to buy or sell a product or service nor an 
official confirmation of any transaction. It is directed at persons who are 
professionals and is not intended for retail customer use. Intended for 
recipient only. This message is subject to the terms at: 
www.barclays.com/emaildisclaimer.

For important disclosures, please see: 
www.barclays.com/salesandtradingdisclaimer regarding market commentary from 
Barclays Sales and/or Trading, who are active market participants; and in 
respect of Barclays Research, including disclosures relating to specific 
issuers, please see http://publicresearch.barclays.com.

___


Re: Repair session failed

2013-05-01 Thread Haithem Jarraya
Can I run scrub while the node is in the ring and receiving writes?
Or I should disable thrift before?


On 1 May 2013 15:52, moshe.kr...@barclays.com wrote:

 Sounds like a job for “nodetool scrub”, which rewrites the SStable rows in
 the correct order. After the scrub, nodetool repair should succeed.

 ** **

 *From:* Haithem Jarraya [mailto:haithem.jarr...@struq.com]
 *Sent:* Wednesday, May 01, 2013 5:46 PM
 *To:* user@cassandra.apache.org
 *Subject:* Repair session failed

 ** **

 Hi, 

 ** **

 I am seeing this error message during repair,

 ** **

  INFO [AntiEntropyStage:1] 2013-05-01 14:30:54,300 AntiEntropyService.java
 (line 764) [repair #ed104480-b26a-11e2-af9b-05179fa66b76] mycolumnfamily is
 fully synced (1 remaining column family to sync for this session)

 ERROR [Thread-12725] 2013-05-01 14:30:54,304 StorageService.java (line
 2420) Repair session failed:

 java.lang.IllegalArgumentException: Requested range intersects a local
 range but is not fully contained in one; this would lead to imprecise repair
 

 at
 org.apache.cassandra.service.AntiEntropyService.getNeighbors(AntiEntropyService.java:175)
 

 at
 org.apache.cassandra.service.AntiEntropyService$RepairSession.init(AntiEntropyService.java:621)
 

 at
 org.apache.cassandra.service.AntiEntropyService$RepairSession.init(AntiEntropyService.java:610)
 

 at
 org.apache.cassandra.service.AntiEntropyService.submitRepairSession(AntiEntropyService.java:127)
 

 at
 org.apache.cassandra.service.StorageService.forceTableRepair(StorageService.java:2480)
 

 at
 org.apache.cassandra.service.StorageService$4.runMayThrow(StorageService.java:2416)
 

 at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)***
 *

 at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)***
 *

 at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

 at java.util.concurrent.FutureTask.run(FutureTask.java:138)

 at java.lang.Thread.run(Thread.java:662)

 ** **

 ** **

 what does it mean imprecise repair?

 Is it maybe because I went over the gcgrade period?

 What you do if you go over that period?

 Any hint will be valuable, 

 Also I noticed when I run a repair on different node, I see a message like
 this

 ** **

 [2013-05-01 14:30:54,305] Starting repair command #5, repairing 1120
 ranges for keyspace struqrealtime

 ** **

 I have couple of questions, why I have repair command #5?

 And why the ranges values changes from one node to another?

 ** **

 ** **

 Many Thanks,

 ** **

 H

 ___

 This message is for information purposes only, it is not a recommendation,
 advice, offer or solicitation to buy or sell a product or service nor an
 official confirmation of any transaction. It is directed at persons who are
 professionals and is not intended for retail customer use. Intended for
 recipient only. This message is subject to the terms at:
 www.barclays.com/emaildisclaimer.

 For important disclosures, please see:
 www.barclays.com/salesandtradingdisclaimer regarding market commentary
 from Barclays Sales and/or Trading, who are active market participants; and
 in respect of Barclays Research, including disclosures relating to specific
 issuers, please see http://publicresearch.barclays.com.

 ___



Re: normal thread counts?

2013-05-01 Thread William Oberman
I've done some more digging, and I have more data but no answers (not
knowing the cassandra internals).

Based on Aaron's comment about gossipinfo/thread dump:

-All IPs that gossip knows about have 2 threads in my thread dump (that
seems ok/fine)

-I have an additional set of IPs in my thread dump in the WRITE- state that
  1.) Used to be part of my cluster, but are not currently
  2.) Had tokens that are NOT part of the cluster anymore

Cassandra is attempting to communicate with these bad IPs once a minute.
 The log for that attempt is at the bottom of this email.  Does this sound
familiar to anyone else?

Log snippet:

/var/log/cassandra/system.log: INFO [GossipStage:1] 2013-05-01 11:05:11,865
Gossiper.java (line 831) InetAddress /10.114.67.189 is now dead.
/var/log/cassandra/system.log: INFO [GossipStage:1] 2013-05-01 11:05:11,866
StorageService.java (line 1303) Removing token 0 for /10.114.67.189



On Tue, Apr 30, 2013 at 5:34 PM, aaron morton aa...@thelastpickle.comwrote:

 The issue below could result in abandoned threads under high contention,
 so we'll get that fixed.

 But we are not sure how/why it would be called so many times. If you could
 provide a full list of threads and the output from nodetool gossipinfo that
 would help.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 1/05/2013, at 8:34 AM, aaron morton aa...@thelastpickle.com wrote:

  Many many many of the threads are trying to talk to IPs that aren't in
 the cluster (I assume they are the IP's of dead hosts).

 Are these IP's from before the upgrade ? Are they IP's you expect to see ?

 Cross reference them with the output from nodetool gossipinfo to see why
 the node thinks they should be used.
 Could you provide a list of the thread names ?

 One way to remove those IPs that may be to rolling restart with
 -Dcassandra.load_ring_state=false i the JVM opts at the bottom of
 cassandra-env.sh

 The OutboundTcpConnection threads are created in pairs by the
 OutboundTcpConnectionPool, which is created here
 https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessagingService.java#L502
  The
 threads are created in the OutboundTcpConnectionPool constructor checking
 to see if this could be the source of the leak.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 1/05/2013, at 2:18 AM, William Oberman ober...@civicscience.com
 wrote:

 I use phpcassa.

 I did a thread dump.  99% of the threads look very similar (I'm using
 1.1.9 in terms of matching source lines).  The thread names are all like
 this: WRITE-/10.x.y.z.  There are a LOT of duplicates (in terms of the
 same IP).  Many many many of the threads are trying to talk to IPs that
 aren't in the cluster (I assume they are the IP's of dead hosts).  The
 stack trace is basically the same for them all, attached at the bottom.

 There is a lot of things I could talk about in terms of my situation, but
 what I think might be pertinent to this thread: I hit a tipping point
 recently and upgraded a 9 node cluster from AWS m1.large to m1.xlarge
 (rolling, one at a time).  7 of the 9 upgraded fine and work great.  2 of
 the 9 keep struggling.  I've replaced them many times now, each time using
 this process:
 http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node
 And even this morning the only two nodes with a high number of threads are
 those two (yet again).  And at some point they'll OOM.

 Seems like there is something about my cluster (caused by the recent
 upgrade?) that causes a thread leak on OutboundTcpConnection   But I don't
 know how to escape from the trap.  Any ideas?


 
   stackTrace = [ {
 className = sun.misc.Unsafe;
 fileName = Unsafe.java;
 lineNumber = -2;
  methodName = park;
 nativeMethod = true;
}, {
 className = java.util.concurrent.locks.LockSupport;
 fileName = LockSupport.java;
 lineNumber = 158;
 methodName = park;
 nativeMethod = false;
}, {
 className =
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject;
 fileName = AbstractQueuedSynchronizer.java;
 lineNumber = 1987;
 methodName = await;
 nativeMethod = false;
}, {
 className = java.util.concurrent.LinkedBlockingQueue;
 fileName = LinkedBlockingQueue.java;
 lineNumber = 399;
 methodName = take;
 nativeMethod = false;
}, {
 className = org.apache.cassandra.net.OutboundTcpConnection;
 fileName = OutboundTcpConnection.java;
 lineNumber = 104;
 methodName = run;
 nativeMethod = false;
} ];
 --




 On Mon, Apr 29, 2013 at 4:31 PM, aaron morton aa...@thelastpickle.comwrote:

  I used JMX to check current number of threads in a production cassandra
 machine, and it was ~27,000.

 That does not sound too good.

Re: HOW TO SET CONSISTENCY LEVEL FOR

2013-05-01 Thread Tyler Hobbs
sstableloader directly streams sstables to the correct replica nodes.  It
doesn't go through the normal coordinated write process, so consistency
levels do not apply.


On Wed, May 1, 2013 at 3:42 AM, Chandana Tummala chandana.tumm...@wipro.com
 wrote:

 Hi All,

 Hi All,

 We have a requirement to load approximately 10 million records, each record
 with approximately 100 columns daily  and automatically . We are planning
 to
 use the Bulk-loader program to convert the data into SSTables and then load
 them using SSTABLELOADER.

 Everything is working fine when all nodes are up and running and the
 performance is very good. However, when a node is down, the streaming fails
 and the operation stops. We have to run the SSTABLELOADER with option 'I'
 to
 exclude the node that is down. I was wondering if we can enforce
 Consistency
 level of 'ANY' with SSTABLELOADER as well.

 i have tried with -i option but the progarm is automatic so finding a
 failure node and running again may decrease the performance .


 can any one suggest the best approach for the bulk loader program  without
 failing even one node is down.


 Environment:

 Cassandra 1.1.9.1  provided as part of DSE 3.0
 8 Nodes
 Replication Factor – 3
 Consistency Level – ANY

 Regards,
 Praveen



 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/HOW-TO-SET-CONSISTENCY-LEVEL-FOR-tp7587567.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.




-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: normal thread counts?

2013-05-01 Thread Janne Jalkanen

This sounds very much like 
https://issues.apache.org/jira/browse/CASSANDRA-5175, which was fixed in 1.1.10.

/Janne

On Apr 30, 2013, at 23:34 , aaron morton aa...@thelastpickle.com wrote:

  Many many many of the threads are trying to talk to IPs that aren't in the 
 cluster (I assume they are the IP's of dead hosts). 
 Are these IP's from before the upgrade ? Are they IP's you expect to see ? 
 
 Cross reference them with the output from nodetool gossipinfo to see why the 
 node thinks they should be used. 
 Could you provide a list of the thread names ? 
 
 One way to remove those IPs that may be to rolling restart with 
 -Dcassandra.load_ring_state=false i the JVM opts at the bottom of 
 cassandra-env.sh
 
 The OutboundTcpConnection threads are created in pairs by the 
 OutboundTcpConnectionPool, which is created here 
 https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessagingService.java#L502
  The threads are created in the OutboundTcpConnectionPool constructor 
 checking to see if this could be the source of the leak. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 1/05/2013, at 2:18 AM, William Oberman ober...@civicscience.com wrote:
 
 I use phpcassa.
 
 I did a thread dump.  99% of the threads look very similar (I'm using 1.1.9 
 in terms of matching source lines).  The thread names are all like this: 
 WRITE-/10.x.y.z.  There are a LOT of duplicates (in terms of the same IP). 
  Many many many of the threads are trying to talk to IPs that aren't in the 
 cluster (I assume they are the IP's of dead hosts).  The stack trace is 
 basically the same for them all, attached at the bottom.   
 
 There is a lot of things I could talk about in terms of my situation, but 
 what I think might be pertinent to this thread: I hit a tipping point 
 recently and upgraded a 9 node cluster from AWS m1.large to m1.xlarge 
 (rolling, one at a time).  7 of the 9 upgraded fine and work great.  2 of 
 the 9 keep struggling.  I've replaced them many times now, each time using 
 this process:
 http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node
 And even this morning the only two nodes with a high number of threads are 
 those two (yet again).  And at some point they'll OOM.
 
 Seems like there is something about my cluster (caused by the recent 
 upgrade?) that causes a thread leak on OutboundTcpConnection   But I don't 
 know how to escape from the trap.  Any ideas?
 
 
 
   stackTrace = [ { 
 className = sun.misc.Unsafe;
 fileName = Unsafe.java;
 lineNumber = -2;
 methodName = park;
 nativeMethod = true;
}, { 
 className = java.util.concurrent.locks.LockSupport;
 fileName = LockSupport.java;
 lineNumber = 158;
 methodName = park;
 nativeMethod = false;
}, { 
 className = 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject;
 fileName = AbstractQueuedSynchronizer.java;
 lineNumber = 1987;
 methodName = await;
 nativeMethod = false;
}, { 
 className = java.util.concurrent.LinkedBlockingQueue;
 fileName = LinkedBlockingQueue.java;
 lineNumber = 399;
 methodName = take;
 nativeMethod = false;
}, { 
 className = org.apache.cassandra.net.OutboundTcpConnection;
 fileName = OutboundTcpConnection.java;
 lineNumber = 104;
 methodName = run;
 nativeMethod = false;
} ];
 --
 
 
 
 
 On Mon, Apr 29, 2013 at 4:31 PM, aaron morton aa...@thelastpickle.com 
 wrote:
  I used JMX to check current number of threads in a production cassandra 
 machine, and it was ~27,000.
 That does not sound too good. 
 
 My first guess would be lots of client connections. What client are you 
 using, does it do connection pooling ?
 See the comments in cassandra.yaml around rpc_server_type, the default uses 
 sync uses one thread per connection, you may be better with HSHA. But if 
 your app is leaking connection you should probably deal with that first. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 30/04/2013, at 3:07 AM, William Oberman ober...@civicscience.com wrote:
 
 Hi,
 
 I'm having some issues.  I keep getting:
 
 ERROR [GossipStage:1] 2013-04-28 07:48:48,876 AbstractCassandraDaemon.java 
 (line 135) Exception in thread Thread[GossipStage:1,5,main]
 java.lang.OutOfMemoryError: unable to create new native thread
 --
 after a day or two of runtime.  I've checked and my system settings seem 
 acceptable:
 memlock=unlimited
 nofiles=10
 nproc=122944
 
 I've messed with heap sizes from 6-12GB (15 physical, m1.xlarge in AWS), 
 and I keep OOM'ing with the above error.
 
 I've found some (what seem to me) to be obscure references to the stack 
 size interacting with # of threads.  If I'm understanding it correctly, to 
 

Re: normal thread counts?

2013-05-01 Thread William Oberman
That has GOT to be it.  1.1.10 upgrade it is...


On Wed, May 1, 2013 at 5:09 PM, Janne Jalkanen janne.jalka...@ecyrd.comwrote:


 This sounds very much like
 https://issues.apache.org/jira/browse/CASSANDRA-5175, which was fixed in
 1.1.10.

 /Janne

 On Apr 30, 2013, at 23:34 , aaron morton aa...@thelastpickle.com wrote:

  Many many many of the threads are trying to talk to IPs that aren't in
 the cluster (I assume they are the IP's of dead hosts).

 Are these IP's from before the upgrade ? Are they IP's you expect to see ?

 Cross reference them with the output from nodetool gossipinfo to see why
 the node thinks they should be used.
 Could you provide a list of the thread names ?

 One way to remove those IPs that may be to rolling restart with
 -Dcassandra.load_ring_state=false i the JVM opts at the bottom of
 cassandra-env.sh

 The OutboundTcpConnection threads are created in pairs by the
 OutboundTcpConnectionPool, which is created here
 https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessagingService.java#L502
  The
 threads are created in the OutboundTcpConnectionPool constructor checking
 to see if this could be the source of the leak.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 1/05/2013, at 2:18 AM, William Oberman ober...@civicscience.com
 wrote:

 I use phpcassa.

 I did a thread dump.  99% of the threads look very similar (I'm using
 1.1.9 in terms of matching source lines).  The thread names are all like
 this: WRITE-/10.x.y.z.  There are a LOT of duplicates (in terms of the
 same IP).  Many many many of the threads are trying to talk to IPs that
 aren't in the cluster (I assume they are the IP's of dead hosts).  The
 stack trace is basically the same for them all, attached at the bottom.

 There is a lot of things I could talk about in terms of my situation, but
 what I think might be pertinent to this thread: I hit a tipping point
 recently and upgraded a 9 node cluster from AWS m1.large to m1.xlarge
 (rolling, one at a time).  7 of the 9 upgraded fine and work great.  2 of
 the 9 keep struggling.  I've replaced them many times now, each time using
 this process:
 http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node
 And even this morning the only two nodes with a high number of threads are
 those two (yet again).  And at some point they'll OOM.

 Seems like there is something about my cluster (caused by the recent
 upgrade?) that causes a thread leak on OutboundTcpConnection   But I don't
 know how to escape from the trap.  Any ideas?


 
   stackTrace = [ {
 className = sun.misc.Unsafe;
 fileName = Unsafe.java;
 lineNumber = -2;
  methodName = park;
 nativeMethod = true;
}, {
 className = java.util.concurrent.locks.LockSupport;
 fileName = LockSupport.java;
 lineNumber = 158;
 methodName = park;
 nativeMethod = false;
}, {
 className =
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject;
 fileName = AbstractQueuedSynchronizer.java;
 lineNumber = 1987;
 methodName = await;
 nativeMethod = false;
}, {
 className = java.util.concurrent.LinkedBlockingQueue;
 fileName = LinkedBlockingQueue.java;
 lineNumber = 399;
 methodName = take;
 nativeMethod = false;
}, {
 className = org.apache.cassandra.net.OutboundTcpConnection;
 fileName = OutboundTcpConnection.java;
 lineNumber = 104;
 methodName = run;
 nativeMethod = false;
} ];
 --




 On Mon, Apr 29, 2013 at 4:31 PM, aaron morton aa...@thelastpickle.comwrote:

  I used JMX to check current number of threads in a production cassandra
 machine, and it was ~27,000.

 That does not sound too good.

 My first guess would be lots of client connections. What client are you
 using, does it do connection pooling ?
 See the comments in cassandra.yaml around rpc_server_type, the default
 uses sync uses one thread per connection, you may be better with HSHA. But
 if your app is leaking connection you should probably deal with that first.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 30/04/2013, at 3:07 AM, William Oberman ober...@civicscience.com
 wrote:

 Hi,

 I'm having some issues.  I keep getting:
 
 ERROR [GossipStage:1] 2013-04-28 07:48:48,876
 AbstractCassandraDaemon.java (line 135) Exception in thread
 Thread[GossipStage:1,5,main]
 java.lang.OutOfMemoryError: unable to create new native thread
 --
 after a day or two of runtime.  I've checked and my system settings seem
 acceptable:
 memlock=unlimited
 nofiles=10
 nproc=122944

 I've messed with heap sizes from 6-12GB (15 physical, m1.xlarge in AWS),
 and I keep OOM'ing with the above error.

 I've found some (what seem to me) to be obscure references to the stack
 size