Re: Collecting of tombstones columns during read query fills up heap

André Cruz Thu, 10 Jan 2013 15:11:20 -0800

On Jan 10, 2013, at 8:01 PM, aaron morton <aa...@thelastpickle.com> wrote:


>>  So, one column represents a file in that directory and it has no value.
> Just so I understand, the file contents are *not* stored in the column value ?

No, on that particular CF the columns are SuperColumns with 5 sub columns 
(size, is_dir, hash, name, revision). Each super column is small, I didn't 
mention super columns before because they don't seem to be related to the 
problem at hand.

>> Basically the heap fills up and if several queries happens simultaneously, 
>> the heap is exhausted and the node stops.
> Are you seeing the GCInspector log messages ? Are they ParNew or CMS 
> compactions?
> If you want to get more insight into what the JVM is doing enable the GC 
> logging options in cassandra-env.sh. 

I see a lot of messages regarding SliceQueryFilter:

DEBUG [ReadStage:53] 2013-01-08 18:08:36,451 SliceQueryFilter.java (line 124) 
collecting 1 of 102: SuperColumn(edbc633e-3f09-11e2-8f7d-e0db55018fa4 -delete 
at 1357508861622915- 
[hash:false:36@1354732265022159,is_dir:false:1@1354732265022159,mtime:false:4@1354732265022159,name:false:57@1354732265022159,revision:false:16@1354732265022159,])
DEBUG [ReadStage:62] 2013-01-08 18:08:36,467 SliceQueryFilter.java (line 124) 
collecting 1 of 102: SuperColumn(75869f16-3f0d-11e2-a935-e0db550199f4 -delete 
at 1357543298946499- 
[hash:false:36@1354733781339045,is_dir:false:1@1354733781339045,mtime:false:4@1354733781339045,name:false:56@1354733781339045,revision:false:16@1354733781339045,])
DEBUG [ReadStage:64] 2013-01-08 18:08:36,449 SliceQueryFilter.java (line 124) 
collecting 1 of 102: SuperColumn(6b3323de-3f0a-11e2-93b7-e0db55018fa4 -delete 
at 1357543981711099- 
[hash:false:36@1354732475524213,is_dir:false:1@1354732475524213,mtime:false:4@1354732475524213,name:false:56@1354732475524213,revision:false:16@1354732475524213,])
DEBUG [ReadStage:51] 2013-01-08 18:08:36,448 SliceQueryFilter.java (line 124) 
collecting 1 of 102: SuperColumn(2e2ccb66-3f0f-11e2-9f34-e0db5501ca40 -delete 
at 1357548656930340- 
[hash:false:36@1354734520625161,is_dir:false:1@1354734520625161,mtime:false:4@1354734520625161,name:false:54@1354734520625161,revision:false:16@1354734520625161,])
DEBUG [ReadStage:62] 2013-01-08 18:08:36,468 SliceQueryFilter.java (line 124) 
collecting 1 of 102: SuperColumn(758c5f3c-3f0d-11e2-a935-e0db550199f4 -delete 
at 1357543303722497- 
[hash:false:36@1354733781376479,is_dir:false:1@1354733781376479,mtime:false:4@1354733781376479,name:false:56@1354733781376479,revision:false:16@1354733781376479,])
DEBUG [ReadStage:61] 2013-01-08 18:08:36,447 SliceQueryFilter.java (line 124) 
collecting 1 of 102: SuperColumn(be15520e-3f08-11e2-843b-e0db550199f4 -delete 
at 1357508704355097- 
[hash:false:36@1354731755577230,is_dir:false:1@1354731755577230,mtime:false:4@1354731755577230,name:false:57@1354731755577230,revision:false:16@1354731755577230,])
DEBUG [ReadStage:52] 2013-01-08 18:08:36,446 SliceQueryFilter.java (line 124) 
collecting 1 of 102: SuperColumn(463b877e-3f0a-11e2-b990-e0db55018fa4 -delete 
at 1357543038078223- 
[hash:false:36@1354732413504338,is_dir:false:1@1354732413504338,mtime:false:4@1354732413504338,name:false:57@1354732413504338,revision:false:16@1354732413504338,])
DEBUG [ReadStage:52] 2013-01-08 18:08:36,471 SliceQueryFilter.java (line 124) 
collecting 1 of 102: SuperColumn(463ef5c6-3f0a-11e2-b990-e0db55018fa4 -delete 
at 1357543038078223- 
[hash:false:36@1354732413523782,is_dir:false:1@1354732413523782,mtime:false:4@1354732413523782,name:false:57@1354732413523782,revision:false:16@1354732413523782,])


GC related messages:

 INFO [ScheduledTasks:1] 2013-01-09 12:11:17,554 GCInspector.java (line 122) GC 
for ParNew: 426 ms for 2 collections, 6138212856 used; max is 8357150720
 INFO [ScheduledTasks:1] 2013-01-09 12:11:19,819 GCInspector.java (line 122) GC 
for ConcurrentMarkSweep: 324 ms for 1 collections, 6136066400 used; max is 
8357150720
 WARN [ScheduledTasks:1] 2013-01-09 12:11:19,820 GCInspector.java (line 145) 
Heap is 0.7342294767181129 full.  You may need to reduce memtable and/or cache 
sizes.  Cassandra will now flush up to the two largest memtables to free up 
memory.  Adjust flush_largest_memtables_at threshold in cassandra.yaml if you 
don't want Cassandra to do this automatically
 WARN [ScheduledTasks:1] 2013-01-09 12:11:19,821 StorageService.java (line 
2855) Flushing CFS(Keyspace='Disco', ColumnFamily='FilesPerBlock') to relieve 
memory pressure
 INFO [ScheduledTasks:1] 2013-01-09 12:11:19,821 ColumnFamilyStore.java (line 
659) Enqueuing flush of Memtable-FilesPerBlock@271892815(3190888/38297827 
serialized/live bytes, 24184 ops)
 INFO [FlushWriter:5] 2013-01-09 12:11:19,822 Memtable.java (line 264) Writing 
Memtable-FilesPerBlock@271892815(3190888/38297827 serialized/live bytes, 24184 
ops)
 INFO [FlushWriter:5] 2013-01-09 12:11:20,118 Memtable.java (line 305) 
Completed flushing 
/servers/storage/cassandra-data/Disco/FilesPerBlock/Disco-FilesPerBlock-he-3793-Data.db
 (3414379 bytes) for commitlog position ReplayPosition(segmentId=1357730243521, 
position=32441486)
 INFO [ScheduledTasks:1] 2013-01-09 12:11:38,413 GCInspector.java (line 122) GC 
for ParNew: 536 ms for 2 collections, 7337859152 used; max is 8357150720
 INFO [ScheduledTasks:1] 2013-01-09 12:11:39,414 GCInspector.java (line 122) GC 
for ParNew: 253 ms for 1 collections, 7746234936 used; max is 8357150720

…

 INFO [ScheduledTasks:1] 2013-01-09 12:12:16,688 GCInspector.java (line 122) GC 
for ConcurrentMarkSweep: 1966 ms for 2 collections, 2690214576 used; max is 
8357150720
 INFO [ScheduledTasks:1] 2013-01-09 12:12:25,035 GCInspector.java (line 122) GC 
for ParNew: 404 ms for 2 collections, 3385333248 used; max is 8357150720
 INFO [ScheduledTasks:1] 2013-01-09 12:12:26,037 GCInspector.java (line 122) GC 
for ParNew: 211 ms for 1 collections, 3606106264 used; max is 8357150720
 INFO [ScheduledTasks:1] 2013-01-09 12:13:12,024 GCInspector.java (line 122) GC 
for ConcurrentMarkSweep: 1043 ms for 2 collections, 4079166304 used; max is 
8357150720
 INFO [ScheduledTasks:1] 2013-01-09 12:13:17,025 MessagingService.java (line 
658) 1 READ messages dropped in last 5000ms

After just a few minutes:

 WARN [ScheduledTasks:1] 2013-01-09 12:14:34,134 GCInspector.java (line 145) 
Heap is 0.9913292333203247 full.  You may need to reduce memtable and/or cache 
sizes.  Cassandra will now flush up to the two largest memtables to free up 
memory.  Adjust flush_largest_memtables_at threshold in cassandra.yaml if you 
don't want Cassandra to do this automatically
 WARN [ScheduledTasks:1] 2013-01-09 12:14:34,170 StorageService.java (line 
2855) Flushing CFS(Keyspace='Disco', ColumnFamily='NamespaceRevision') to 
relieve memory pressure
 INFO [ScheduledTasks:1] 2013-01-09 12:14:34,241 ColumnFamilyStore.java (line 
659) Enqueuing flush of Memtable-NamespaceRevision@1129512169(3597280/17167436 
serialized/live bytes, 11918 ops)
 INFO [FlushWriter:6] 2013-01-09 12:14:34,243 Memtable.java (line 264) Writing 
Memtable-NamespaceRevision@1129512169(3597280/17167436 serialized/live bytes, 
11918 ops)

> 
>> Dumping the SSTables shows that there were a lot of tombstones between those 
>> 2 columns.
> How many is a lot ?

Millions. I have dumped the SSTables to JSON, but have yet to figure out a way 
to parse and obtain more information like an exact number since the files are 
so big.


>>  Normally I run with a 8GB heap and have no problems, but problematic 
>> queries can fill up the heap even if I bump it up to 24GB. The machines have 
>> 32GB.
> For queries like this it's (usually) not the overall size of the JVM heap, 
> Xmx.
> It's the size of the NEW_HEAP (in cassandra-env.sh) which sets Xmn. And the 
> other new heap settings, SurvivorRatio and MaxTenuringThreshold. What 
> settings do you have for those ?

Well, as a matter of fact I tried to increase the heap to 16GB, and then I 
dumped it and these were the statistics:

Heap
 par new generation   total 737280K, used 656554K [0x00000003fae00000, 
0x000000042ce00000, 0x000000042ce00000)
  eden space 655360K, 100% used [0x00000003fae00000, 0x0000000422e00000, 
0x0000000422e00000)
  from space 81920K,   1% used [0x0000000427e00000, 0x0000000427f2aa98, 
0x000000042ce00000)
  to   space 81920K,   0% used [0x0000000422e00000, 0x0000000422e00000, 
0x0000000427e00000)
 concurrent mark-sweep generation total 15958016K, used 15958015K 
[0x000000042ce00000, 0x00000007fae00000, 0x00000007fae00000)
 concurrent-mark-sweep perm gen total 43068K, used 25708K [0x00000007fae00000, 
0x00000007fd80f000, 0x0000000800000000)


These are the relevant JVM settings:

-XX:+UseThreadPriorities 
-XX:ThreadPriorityPolicy=42 
-Xms8049M 
-Xmx8049M 
-Xmn800M 
-XX:+HeapDumpOnOutOfMemoryError 
-Xss196k 
-XX:+UseParNewGC 
-XX:+UseConcMarkSweepGC 
-XX:+CMSParallelRemarkEnabled 
-XX:SurvivorRatio=8 
-XX:MaxTenuringThreshold=1 
-XX:CMSInitiatingOccupancyFraction=75 
-XX:+UseCMSInitiatingOccupancyOnly 
-Djava.net.preferIPv4Stack=true 

> 
>> Of course, the problem goes away after gc_grace_seconds pass and I run a 
>> manual compact on that CF, the tombstones are removed and queries to that 
>> row are efficient again.
> If you have a CF that has a high number of overwrites or deletions using 
> Levelled Compaction can help. It does use up some more IO that sized tiered 
> but it's designed for these sorts of situations. See 
> http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra and 
> http://www.datastax.com/dev/blog/when-to-use-leveled-compaction

I'm assuming you mean that tombstones are purged quicker after gc_grace_period 
using Levelled compaction, right? The problem remains before gc_grace_period 
occurs. I'll take a look at that again, the first time I've looked I was 
worried that too many files would be created per-CF. 10MB per stable seems too 
little, what real-world values are used for that setting for large (20GB+) CFs?

> Schema wise, you could try have multiple "directory" rows for each user. At 
> certain times you can create a new row, which then receives all the writes. 
> But you read (and delete if necessary) from all rows. Then migrate the data 
> from the old rows to the new one and remove the old row.

I still have not delved into different schemas, this one seems simple enough 
that Cassandra should handle it. For example, using your suggestion, it would 
imply some sort of synchronisation whenever an operation decided that a new row 
should be written otherwise I could lose some updates. I guess I could use 
Zookeeper for that but it seems over-engineering.

Thanks,
André

Re: Collecting of tombstones columns during read query fills up heap

Reply via email to