[jira] [Comment Edited] (CASSANDRA-12730) Thousands of empty SSTables created during repair - TMOF death

Benjamin Roth (JIRA) Mon, 14 Nov 2016 11:39:43 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-12730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15663298#comment-15663298
 ]


Benjamin Roth edited comment on CASSANDRA-12730 at 11/14/16 7:38 PM:
---------------------------------------------------------------------

Hi guys!

First of all: Thanks for all the help and information so far.

Back from Web Summit with more time and better internet :) I had some more 
closer looks on the topic. Yesterday I was able to watch the situation while 
the issue was reoccuring, so I was able to capture some explaining logs.
The graph: https://cl.ly/2d2B2B1V3T3q
The debug log + dirlist of an affected table: https://cl.ly/063W3A0Y2b2X

I guess my first assumption that the sstables are created by a flush is 
actually not quite right. According to the logs it seems the sstables are 
created by streams (StreamReceiveTask, cfs.addSSTables(readers)). Of course 
this is the most efficient approach if SStables have an "appropriate" size. But 
when this results in thousands of super-tiny SSTables, the opposite is the 
case. I had to stop the repair again to give the nodes some time to catch up. 
Otherwise at least one node would have crashed over night and this most 
probably would have f***** the consistency of the whole cluster (again).
I even think that CASSANDRA-12580 could have deteriorated that situation as 
higher resolution merkle trees probably result in more but smaller streams.

>From my point of view this situation is not acceptable. It is too dangerous to 
>run a production system with such a behaviour. My proposal for this very 
>situation is:
Why not let these super-tiny streams go through the regular write path instead 
of creating an SSTable? I guess the check would be very simple. We could either 
use a hard limit (like for example 1MB) or as a fraction of total memtable 
space.
This is one thing.

Something different but somehow related:
When I was browsing through code and trying to get a bigger picture of that 
situation, I recognized some things that made me wonder:
1. Memtables are flushed before EVERY stream (StreamSession.addTransferRanges), 
this applies both to requesting and sending streams
2. Validation compcation forces a flush 
(CompactionManager.doValidationCompaction)

So my conclusion was:
1. A flush on validation is only required if memtable contains data for the 
requested validation (Validator.desc.ranges)
2. Only streams for rebuild + bootstrap require a flush, repair streams do not, 
because
- Bootstrap + Rebuild wants ALL the data the remote node has to offer, so do 
flush
- A repair should only require the data that has declared "out of sync" by the 
validation. All the validated data has been flushed before on every node that 
is part of the repair process. So there should be no need to flush again, right?

These changes are quite easy to implement and I already prepared some code, 
except: A StreamRequest does not have a flag to determine if a flush is 
required or to send the request type like repair or bootstrap. Changing that 
would brake the protocol and I don't know how to deal with that situation, so 
this change seems to be a bit more complicated or maybe not worth the effort.

So sum it up:
I'd appreciate very much to get some feedback on 
1. Optimization of handling tiny streams in StreamReceiveTask
2. Avoiding unnecessary flushes
3. If 2. applies, if or how to deal with protocol changes
4. How can I contribute to that in a proper way like: How to contribute code, 
when and how to write unit tests ... all the boring stuff ;)


was (Author: brstgt):
Hi guys!

First of all: Thanks for all the help and information so far.

Back from Web Summit with more time and better internet :) I had some more 
closer looks on the topic. Yesterday I was able to watch the situation while 
the issue was reoccuring, so I was able to capture some explaining logs.
The graph: https://cl.ly/2d2B2B1V3T3q
The debug log + dirlist of an affected table: https://cl.ly/063W3A0Y2b2X

I guess my first assumption that the sstables are created by a flush is 
actually not quite right. According to the logs it seems the sstables are 
created by streams (StreamReceiveTask, cfs.addSSTables(readers)). Of course 
this is the most efficient approach if SStables have an "appropriate" size. But 
when this results in thousands of super-tiny SSTables, the opposite is the 
case. I had to stop the repair again to give the nodes some time to catch up. 
Otherwise at least one node would have crashed over night and this most 
probably would have f***** the consistency of the whole cluster (again).
I even think that CASSANDRA-12580 could have deteriorated that situation as 
higher resolution merkle trees probably result in more but smaller streams.

>From my point of view this situation is not acceptable. It is too dangerous to 
>run a production system with such a behaviour. My proposal for this very 
>situation is:
Why not let these super-tiny streams go through the regular write path instead 
of creating an SSTable? I guess the check would be very simple. We could either 
use a hard limit (like for example 1MB) or as a fraction of total memtable 
space.
This is one thing.

Something different but somehow related:
When I was browsing through code and trying to get a bigger picture of that 
situation, I recognized some things that made me wonder:
1. Memtables are flushed before EVERY stream (StreamSession.addTransferRanges), 
this applies both to requesting and sending streams
2. Validation compcation forces a flush 
(CompactionManager.doValidationCompaction)

So my conclusion was:
1. A flush on validation is only required if memtable contains data for the 
requested validation (Validator.desc.ranges)
2. Only streams for rebuild + bootstrap require a flush, repair streams do not, 
because
- Bootstrap + Rebuild wants ALL the data the remote node has to offer, so do 
flush
- A repair should only require the data that has declared "out of sync" by the 
validation. All the validated data has been flushed before on every node that 
is part of the repair process. So there should be no need to flush again, right?

These changes are quite easy to implement and I already prepared some code, 
except: A StreamRequest does not have a flag to determine if a flush is 
required or to send the request type like repair or bootstrap. Changing that 
would brake the protocol and I don't know how to deal with that situation, so 
this change seems to be a bit more complicated or maybe not worth the effort.

So sum it up:
I'd appreciate very much some to get some feedback on 
1. Optimizing how tiny streams are handled in StreamReceiveTask
2. Avoiding unnecessary flushes
3. If 2. applies, if or how to deal with protocol changes
4. How can I contribute to that in a proper way like: How to contribute code, 
when and how to write unit tests ... all the boring stuff ;)

> Thousands of empty SSTables created during repair - TMOF death
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-12730
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12730
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local Write-Read Paths
>            Reporter: Benjamin Roth
>            Priority: Critical
>
> Last night I ran a repair on a keyspace with 7 tables and 4 MVs each 
> containing a few hundret million records. After a few hours a node died 
> because of "too many open files".
> Normally one would just raise the limit, but: We already set this to 100k. 
> The problem was that the repair created roughly over 100k SSTables for a 
> certain MV. The strange thing is that these SSTables had almost no data (like 
> 53bytes, 90bytes, ...). Some of them (<5%) had a few 100 KB, very few (<1% 
> had normal sizes like >= few MB). I could understand, that SSTables queue up 
> as they are flushed and not compacted in time but then they should have at 
> least a few MB (depending on config and avail mem), right?
> Of course then the node runs out of FDs and I guess it is not a good idea to 
> raise the limit even higher as I expect that this would just create even more 
> empty SSTables before dying at last.
> Only 1 CF (MV) was affected. All other CFs (also MVs) behave sanely. Empty 
> SSTables have been created equally over time. 100-150 every minute. Among the 
> empty SSTables there are also Tables that look normal like having few MBs.
> I didn't see any errors or exceptions in the logs until TMOF occured. Just 
> tons of streams due to the repair (which I actually run over cs-reaper as 
> subrange, full repairs).
> After having restarted that node (and no more repair running), the number of 
> SSTables went down again as they are compacted away slowly.
> According to [~zznate] this issue may relate to CASSANDRA-10342 + 
> CASSANDRA-8641



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-12730) Thousands of empty SSTables created during repair - TMOF death

Reply via email to