Re: Re: Re: What is consuming the heap?

2010-07-22 Thread
The version we are using is 0.6.1

2010-07-23 







发件人: 王一锋 
发送时间: 2010-07-23  09:38:15 
收件人: user 
抄送: 
主题: Re: Re: Re: What is consuming the heap? 
 
Yes, we are doing a lot of inserts.

But how can CASSANDRA-1042 cause an OutOfMemory?
And we are using multigetSlice(). We are not doing any get_range_slice() at all.

2010-07-23 







发件人: Jonathan Ellis 
发送时间: 2010-07-21  21:17:21 
收件人: user 
抄送: 
主题: Re: Re: What is consuming the heap? 
On Tue, Jul 20, 2010 at 11:33 PM, Peter Schuller
 wrote:
>>  INFO [GC inspection] 2010-07-21 01:01:49,661 GCInspector.java (line 110) GC 
>> for ConcurrentMarkSweep: 11748 ms, 413673472 reclaimed leaving 9779542600 
>> used; max is 10873667584
>> ERROR [Thread-35] 2010-07-21 01:02:10,941 CassandraDaemon.java (line 78) 
>> Fatal exception in thread Thread[Thread-35,5,main]
>> java.lang.OutOfMemoryError: Java heap space
>>  INFO [GC inspection] 2010-07-21 01:02:10,958 GCInspector.java (line 110) GC 
>> for ConcurrentMarkSweep: 10043 ms, 259576 reclaimed leaving 10172790816 
>> used; max is 10873667584
>
> So that confirms a "legitimate" out-of-memory condition in the sense
> that CMS is reclaiming extremely little and the live set after a
> concurrent mark/sweep is indeed around the 10 gig.
Are you doing a lot of inserts?  You might be hitting
https://issues.apache.org/jira/browse/CASSANDRA-1042
-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Re: Re: What is consuming the heap?

2010-07-22 Thread
Yes, we are doing a lot of inserts.

But how can CASSANDRA-1042 cause an OutOfMemory?
And we are using multigetSlice(). We are not doing any get_range_slice() at all.

2010-07-23 







发件人: Jonathan Ellis 
发送时间: 2010-07-21  21:17:21 
收件人: user 
抄送: 
主题: Re: Re: What is consuming the heap? 
 
On Tue, Jul 20, 2010 at 11:33 PM, Peter Schuller
 wrote:
>>  INFO [GC inspection] 2010-07-21 01:01:49,661 GCInspector.java (line 110) GC 
>> for ConcurrentMarkSweep: 11748 ms, 413673472 reclaimed leaving 9779542600 
>> used; max is 10873667584
>> ERROR [Thread-35] 2010-07-21 01:02:10,941 CassandraDaemon.java (line 78) 
>> Fatal exception in thread Thread[Thread-35,5,main]
>> java.lang.OutOfMemoryError: Java heap space
>>  INFO [GC inspection] 2010-07-21 01:02:10,958 GCInspector.java (line 110) GC 
>> for ConcurrentMarkSweep: 10043 ms, 259576 reclaimed leaving 10172790816 
>> used; max is 10873667584
>
> So that confirms a "legitimate" out-of-memory condition in the sense
> that CMS is reclaiming extremely little and the live set after a
> concurrent mark/sweep is indeed around the 10 gig.
Are you doing a lot of inserts?  You might be hitting
https://issues.apache.org/jira/browse/CASSANDRA-1042
-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Re: Re: What is consuming the heap?

2010-07-22 Thread
Yes, I'm running with defaults settings otherwise.
For cache sizes, I've tried '0' for non-cached, '1' for full cached and a fixed 
value of 50, for KeysCached, RowsCached was using default everytime.
So I don't think the problem is about the cache.
Concurrent read was 32, write was 64
I also tried 320 and 640

The read/write ratio is about 2/1

How much memory will it need to do a compaction?
another 2 nodes went down last night. They were doing a compaction before they 
went down, judging from the timestamp of the *tmp* files in the data folder. 

Stack trace for node 1
 INFO [GC inspection] 2010-07-23 04:13:24,517 GCInspector.java (line 110) GC 
for ConcurrentMarkSweep: 31275 ms, 29578704 reclaimed leaving 10713006792 used; 
max is 10873667584
ERROR [MESSAGE-DESERIALIZER-POOL:1] 2010-07-23 04:14:30,656 
DebuggableThreadPoolExecutor.java (line 94) Error in executor futuretask
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap 
space
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at 
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:888)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.OutOfMemoryError: Java heap space
at 
org.apache.cassandra.net.MessageSerializer.deserialize(Message.java:138)
at 
org.apache.cassandra.net.MessageDeserializationTask.run(MessageDeserializationTask.java:45)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
... 2 more

Stack trace for node 2
 INFO [COMMIT-LOG-WRITER] 2010-07-23 01:41:06,550 CommitLogSegment.java (line 
50) Creating new commitlog segment 
/opt/crawler/cassandra/sysdata/commitlog/CommitLog-1279820466550.log
 INFO [Timer-1] 2010-07-23 01:41:09,027 Gossiper.java (line 179) InetAddress 
/183.62.134.31 is now dead.
 INFO [ROW-MUTATION-STAGE:45] 2010-07-23 01:41:09,279 ColumnFamilyStore.java 
(line 357) source_page has reached its threshold; switching in a fresh Memtable 
at 
CommitLogContext(file='/opt/crawler/cassandra/sysdata/commitlog/CommitLog-1279820466550.log',
 position=9413)
 INFO [ROW-MUTATION-STAGE:45] 2010-07-23 01:41:09,322 ColumnFamilyStore.java 
(line 609) Enqueuing flush of Memtable(source_page)@1343553539
 INFO [FLUSH-WRITER-POOL:1] 2010-07-23 01:41:09,323 Memtable.java (line 148) 
Writing Memtable(source_page)@1343553539
 INFO [GMFD:1] 2010-07-23 01:41:09,349 Gossiper.java (line 568) InetAddress 
/183.62.134.30 is now UP
 INFO [GMFD:1] 2010-07-23 01:41:09,349 Gossiper.java (line 568) InetAddress 
/183.62.134.31 is now UP
 INFO [GMFD:1] 2010-07-23 01:41:09,350 Gossiper.java (line 568) InetAddress 
/183.62.134.28 is now UP
 INFO [GMFD:1] 2010-07-23 01:41:09,350 Gossiper.java (line 568) InetAddress 
/183.62.134.26 is now UP
 INFO [GMFD:1] 2010-07-23 01:41:09,350 Gossiper.java (line 568) InetAddress 
/183.62.134.27 is now UP
 INFO [GMFD:1] 2010-07-23 01:41:09,350 Gossiper.java (line 568) InetAddress 
/183.62.134.24 is now UP
 INFO [GMFD:1] 2010-07-23 01:41:09,350 Gossiper.java (line 568) InetAddress 
/183.62.134.25 is now UP
 INFO [GMFD:1] 2010-07-23 01:41:09,350 Gossiper.java (line 568) InetAddress 
/183.62.134.22 is now UP
 INFO [GMFD:1] 2010-07-23 01:41:09,350 Gossiper.java (line 568) InetAddress 
/183.62.134.23 is now UP
 INFO [GMFD:1] 2010-07-23 01:41:09,351 Gossiper.java (line 568) InetAddress 
/183.62.134.33 is now UP
 INFO [GMFD:1] 2010-07-23 01:41:09,351 Gossiper.java (line 568) InetAddress 
/183.62.134.32 is now UP
 INFO [GMFD:1] 2010-07-23 01:41:09,351 Gossiper.java (line 568) InetAddress 
/183.62.134.34 is now UP
 INFO [GC inspection] 2010-07-23 01:41:24,192 GCInspector.java (line 110) GC 
for ConcurrentMarkSweep: 12908 ms, 413977296 reclaimed leaving 9524655928 used; 
max is 10873667584
 INFO [Timer-1] 2010-07-23 01:41:50,867 Gossiper.java (line 179) InetAddress 
/183.62.134.34 is now dead.
 INFO [Timer-1] 2010-07-23 01:41:50,871 Gossiper.java (line 179) InetAddress 
/183.62.134.33 is now dead.
 INFO [Timer-1] 2010-07-23 01:41:50,871 Gossiper.java (line 179) InetAddress 
/183.62.134.32 is now dead.
 INFO [Timer-1] 2010-07-23 01:41:50,871 Gossiper.java (line 179) InetAddress 
/183.62.134.31 is now dead.
 INFO [Timer-1] 2010-07-23 01:41:50,872 Gossiper.java (line 179) InetAddress 
/183.62.134.30 is now dead.
 INFO [Timer-1] 2010-07-23 01:41:50,872 Gossiper.java (line 179) InetAddress 
/183.62.134.28 is now dead.
 INFO [Timer-1] 2010-07-23 01:41:50,87

Re: Re: Re: What is consuming the heap?

2010-07-20 Thread

no, I'm using QUORUM for both writes and reads
Replication factor is 3

2010-07-21 







发件人: Dathan Pattishall 
发送时间: 2010-07-21  12:51:32 
收件人: user 
抄送: 
主题: Re: Re: What is consuming the heap? 
 
By off chance on writes are you using ConsistencyLevel::ZERO?





On Tue, Jul 20, 2010 at 9:41 PM, 王一锋  wrote:

So the bloom filters reside in memory completely?

We do have a lot of small values, hundreds of millions of columns in a 
columnfamily.

I count the total size of *-Filter.db files in my keyspace, it's 
436,747,815bytes.

I guess this means it won't consume a major part of 10g heap space


2010-07-21 







发件人: Peter Schuller 
发送时间: 2010-07-20  21:45:08 
收件人: user 
抄送: 
主题: Re: What is consuming the heap? 
> heap size is 10G and the load of data per node was around 300G, 16-core CPU,
Are the 300 GB made up of *really* small values? Per SS table bloom
filters do consume memory, but you'd have to have a *lot* of *really*
small values for a 300 GB database to cause bloom filters to be a
significant part of a 10 GB h eap.
-- 
/ Peter Schuller


Re: Re: What is consuming the heap?

2010-07-20 Thread
So the bloom filters reside in memory completely?

We do have a lot of small values, hundreds of millions of columns in a 
columnfamily.

I count the total size of *-Filter.db files in my keyspace, it's 
436,747,815bytes.

I guess this means it won't consume a major part of 10g heap space


2010-07-21 







发件人: Peter Schuller 
发送时间: 2010-07-20  21:45:08 
收件人: user 
抄送: 
主题: Re: What is consuming the heap? 
 
> heap size is 10G and the load of data per node was around 300G, 16-core CPU,
Are the 300 GB made up of *really* small values? Per SS table bloom
filters do consume memory, but you'd have to have a *lot* of *really*
small values for a 300 GB database to cause bloom filters to be a
significant part of a 10 GB h eap.
-- 
/ Peter Schuller


Re: Re: What is consuming the heap?

2010-07-20 Thread
I can only find these in the system.log

 INFO [GC inspection] 2010-07-21 01:01:49,661 GCInspector.java (line 110) GC 
for ConcurrentMarkSweep: 11748 ms, 413673472 reclaimed leaving 9779542600 used; 
max is 10873667584
ERROR [Thread-35] 2010-07-21 01:02:10,941 CassandraDaemon.java (line 78) Fatal 
exception in thread Thread[Thread-35,5,main]
java.lang.OutOfMemoryError: Java heap space
 INFO [GC inspection] 2010-07-21 01:02:10,958 GCInspector.java (line 110) GC 
for ConcurrentMarkSweep: 10043 ms, 259576 reclaimed leaving 10172790816 used; 
max is 10873667584




2010-07-21 







发件人: Jonathan Ellis 
发送时间: 2010-07-20  19:26:11 
收件人: user 
抄送: 
主题: Re: What is consuming the heap? 
 
you should post the full stack trace.
2010/7/20 王一锋 :
> In my cluster, I have set both KeysCached and RowsCached of my column family
> on all nodes to "0",
> but it still happened that a few nodes crashed because of OutOfMemory
> (from the gc.log, a full gc wasn't able to free up any memory space),
>
> what else can be consuming the heap?
>
> heap size is 10G and the load of data per node was around 300G, 16-core CPU,
> 1T HDD
>
> 2010-07-20
> 
-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: SV: What is consuming the heap?

2010-07-19 Thread

No, I don't think so. Because I'm not using supercolumn and size of a column 
will not exceed 1M

2010-07-20 







发件人: Thorvaldsson Justus 
发送时间: 2010-07-20  14:52:22 
收件人: 'user@cassandra.apache.org' 
抄送: 
主题: SV: What is consuming the heap? 
 
Supercolumn/column must fit into node memory
It could be?
/Justus
Från: 王一锋 [mailto:wangyif...@aspire-tech.com] 
Skickat: den 20 juli 2010 08:48
Till: user
Ämne: What is consuming the heap?
 
In my cluster, I have set both KeysCached and RowsCached of my column family on 
all nodes to "0", 
but it still happened that a few nodes crashed because of OutOfMemory
(from the gc.log, a full gc wasn't able to free up any memory space), 
 
what else can be consuming the heap?
 
heap size is 10G and the load of data per node was around 300G, 16-core CPU, 1T 
HDD
 
2010-07-20 


What is consuming the heap?

2010-07-19 Thread
In my cluster, I have set both KeysCached and RowsCached of my column family on 
all nodes to "0", 
but it still happened that a few nodes crashed because of OutOfMemory
(from the gc.log, a full gc wasn't able to free up any memory space), 

what else can be consuming the heap?

heap size is 10G and the load of data per node was around 300G, 16-core CPU, 1T 
HDD

2010-07-20 


Frequent crashes

2010-07-14 Thread
Hi, 

Has anyboy done any memory usage analysis for cassandra?

How much memory does cassandra need to manager 300G of data load? How much 
extra memory will be needed when doing compaction?

Regarding mmap, memory usage will be determined by the OS so it has nothing to 
do with the heap size of JVM, am I right?

I have a cassandra cluster of 13 nodes, each with 200~300g data.
JVM settings
JVM_OPTS=" \
-ea \
-Xms6G \
-Xmx6G \
-XX:TargetSurvivorRatio=90 \
-XX:+AggressiveOpts \
-XX:+UseParNewGC \
-XX:+UseConcMarkSweepGC \
-XX:+CMSParallelRemarkEnabled \
-XX:+HeapDumpOnOutOfMemoryError \
-XX:SurvivorRatio=128 \
-XX:MaxTenuringThreshold=0 \
-XX:+PrintGC -Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps \
-Dcom.sun.management.jmxremote.port=4993 \
-Dcom.sun.management.jmxremote.ssl=false \
-Dcom.sun.management.jmxremote.authenticate=false"
KeysCache settings for 3 column families are 5,000,000  1,000,000  1,000,000

some nodes run for 1 to 2 days, and then gets very slow, due to bad gc 
performance, then crashed. This happed quite a lot, almost every day. 
Here is a fragment of the gc.log

 (concurrent mode failure): 6014591K->6014591K(6014592K), 25.4846400 secs] 
6289343K->6282274K(6289344K), [CMS Perm : 17290K->17287K(28988K)], 25.4848970 
secs] [Times: user=37.76 sys=0.12, real=25.49 secs] 
69695.771: [Full GC 69695.771: [CMS: 6014591K->6014591K(6014592K), 21.0911470 
secs] 6289343K->6282177K(6289344K), [CMS Perm : 17287K->17287K(28988K)], 
21.0913910 secs] [Times: user=21.01 sys=0.12, real=21.09 secs] 
69716.902: [GC [1 CMS-initial-mark: 6014591K(6014592K)] 6287620K(6289344K), 
0.2759980 secs] [Times: user=0.28 sys=0.00, real=0.28 secs] 
69717.178: [CMS-concurrent-mark-start]
69717.203: [Full GC 69717.203: [CMS69721.345: [CMS-concurrent-mark: 4.152/4.167 
secs] [Times: user=16.64 sys=0.01, real=4.17 secs] 
 (concurrent mode failure): 6014592K->6014591K(6014592K), 25.3649330 secs] 
6289343K->6282200K(6289344K), [CMS Perm : 17287K->17287K(28988K)], 25.3651670 
secs] [Times: user=37.67 sys=0.13, real=25.37 secs] 
69742.598: [Full GC 69742.598: [CMS: 6014591K->6014592K(6014592K), 21.0942430 
secs] 6289343K->6282398K(6289344K), [CMS Perm : 17290K->17287K(28988K)], 
21.0944950 secs] [Times: user=21.00 sys=0.12, real=21.10 secs] 
69763.721: [Full GC 69763.721: [CMS: 6014592K->6014591K(6014592K), 21.0978230 
secs] 6289343K->6282553K(6289344K), [CMS Perm : 17290K->17287K(28988K)], 
21.0980600 secs] [Times: user=20.99 sys=0.12, real=21.09 secs] 
69784.830: [GC [1 CMS-initial-mark: 6014591K(6014592K)] 6287995K(6289344K), 
0.2765360 secs] [Times: user=0.28 sys=0.00, real=0.28 secs] 
69785.107: [CMS-concurrent-mark-start]
69785.123: [Full GC 69785.123: [CMS69789.244: [CMS-concurrent-mark: 4.132/4.136 
secs] [Times: user=16.49 sys=0.03, real=4.13 secs] 
 (concurrent mode failure): 6014591K->6014591K(6014592K), 26.0883660 secs] 
6289343K->6282549K(6289344K), [CMS Perm : 17290K->17287K(28988K)], 26.0886060 
secs] [Times: user=38.28 sys=0.15, real=26.09 secs] 

Anybody got an idea?


Re: RE: live nodes list in ring

2010-06-30 Thread

problem solved with a full cluster stop and then start~

Thanks Stephen.
But still wondering the reason


2010-07-01 







发件人: Stephen Hamer 
发送时间: 2010-07-01  08:39:13 
收件人: user@cassandra.apache.org 
抄送: 'huzhonghua'; 'GongJianTao(宫建涛)' 
主题: RE: live nodes list in ring 
 
When this happens to me I have to do a full cluster restart. Even doing a 
rolling restart across the cluster doesn’t seem to fix them, all of the nodes 
need to be stopped at the same time. After bringing everything back up the ring 
is correct.
 
Does anyone know how a cluster gets into this state?
 
Stephen
 
From: aaron morton [mailto:aa...@thelastpickle.com] 
Sent: Wednesday, June 30, 2010 1:42 PM
To: user@cassandra.apache.org
Cc: 'huzhonghua'; 'GongJianTao(宫建涛)'
Subject: Re: live nodes list in ring
 
At start up do you see log lines like this 
 
Gossiper.java (line 576) Node /192.168.34.30 is now part of the cluster
 
Are all the nodes listed? 
 
aaron
On 30 Jun 2010, at 22:50, 王一锋 wrote:



Hi,
 
In a cassandra cluster, when issueing ring command on every nodes, some can 
show all nodes in the cluster but some can only show some other nodes.
All nodes share the same seed list.
And even some of the nodes in the seed list have this problem.
Restarting the problematic nodes won't solve it.
Try closing firewalls with following commands
 
service iptables stop
 
Still won't work.
 
Anyone got a clue?
 
Thanks very much.
 
Yifeng
 


Re: Re: live nodes list in ring

2010-06-30 Thread
To me, this happens all in a sudden. No nodes removed nor new nodes joined

2010-07-01 







发件人: Benjamin Black 
发送时间: 2010-07-01  09:55:59 
收件人: user 
抄送: 
主题: Re: live nodes list in ring 
 
Does this happen after you have changed the ring topology, especially
adding nodes?
2010/6/30 Stephen Hamer :
> When this happens to me I have to do a full cluster restart. Even doing a
> rolling restart across the cluster doesn't seem to fix them, all of the
> nodes need to be stopped at the same time. After bringing everything back up
> the ring is correct.
>
>
>
> Does anyone know how a cluster gets into this state?
>
>
>
> Stephen
>
>
>
> From: aaron morton [mailto:aa...@thelastpickle.com]
> Sent: Wednesday, June 30, 2010 1:42 PM
> To: user@cassandra.apache.org
> Cc: 'huzhonghua'; 'GongJianTao(宫建涛)'
> Subject: Re: live nodes list in ring
>
>
>
> At start up do you see log lines like this
>
>
>
> Gossiper.java (line 576) Node /192.168.34.30 is now part of the cluster
>
>
>
> Are all the nodes listed?
>
>
>
> aaron
>
> On 30 Jun 2010, at 22:50, 王一锋 wrote:
>
> Hi,
>
>
>
> In a cassandra cluster, when issueing ring command on every nodes, some can
> show all nodes in the cluster but some can only show some other nodes.
>
> All nodes share the same seed list.
>
> And even some of the nodes in the seed list have this problem.
>
> Restarting the problematic nodes won't solve it.
>
> Try closing firewalls with following commands
>
>
>
> service iptables stop
>
>
>
> Still won't work.
>
>
>
> Anyone got a clue?
>
>
>
> Thanks very much.
>
>
>
> Yifeng
>
>


Re: Re: live nodes list in ring

2010-06-30 Thread
Yes, but not all nodes.
Just the part that the restarting node can show on its ring.

2010-07-01 







发件人: aaron morton 
发送时间: 2010-07-01  04:42:56 
收件人: user 
抄送: 'huzhonghua'; 'GongJianTao(宫建涛 )' 
主题: Re: live nodes list in ring 
 
At start up do you see log lines like this 


Gossiper.java (line 576) Node /192.168.34.30 is now part of the cluster


Are all the nodes listed? 


aaron
On 30 Jun 2010, at 22:50, 王一锋 wrote:


Hi,

In a cassandra cluster, when issueing ring command on every nodes, some can 
show all nodes in the cluster but some can only show some other nodes.
All nodes share the same seed list.
And even some of the nodes in the seed list have this problem.
Restarting the problematic nodes won't solve it.
Try closing firewalls with following commands

service iptables stop

Still won't work.

Anyone got a clue?

Thanks very much.

Yifeng


live nodes list in ring

2010-06-30 Thread
Hi, 

In a cassandra cluster, when issueing ring command on every nodes, some can 
show all nodes in the cluster but some can only show some other nodes.
All nodes share the same seed list.
And even some of the nodes in the seed list have this problem.
Restarting the problematic nodes won't solve it.
Try closing firewalls with following commands 

service iptables stop 

Still won't work.

Anyone got a clue?

Thanks very much.

Yifeng


Benefits of using framed transport over non-framed transport?

2010-05-07 Thread
Hi everyone, 

Can anyone throw a light at the benefits of using framed transport over 
non-framed transport? 

We are trying to sum up some performance tuning approaches of cassandra in our 
project.
Can framed transport be counted?

Thanks

2010-05-08