[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14360188#comment-14360188 ] Benedict commented on CASSANDRA-8833: - At the risk of sounding like a broken record (in case my earlier statement was missed): bloom filter resizing would need to break this assumption also, and is sort of intrinsically linked to the discussion around summary resizing. Whether or not we want this is another matter I'll leave aside for the moment. Just to outline the remainder of my strategy for mitigating any and all of these risks: * In CASSANDRA-8568, I will: ** introduce a "stable" set of sstables that never changes, and all non-hot-path accesses will use these so they don't risk confusion, and don't have to worry about first/last issues ** ensure compaction strategies are only informed of changes to this "stable" set of readers ** remove the "shadowed" state of an sstable ** make the modification of tracker state transactional and more declarative, so both easier to follow and much harder to let get into a bad state * CASSANDRA-8893, CASSANDRA-7066 and some related work will: ** eliminate the distinction between early open, temporary, and final files on disk, so eliminate at least one layer of the cleanup logic (i.e. make its requirements equivalent to summary/bf resizing) ** which also permits us to simplify the early open logic, by special casing it much less In conjunction with the major overhaul of resource cleanup, AFAICT this mitigates most of the problems: * resource counting is now much easier to reason about, and soon will be even easier. it is also safer to get it wrong. * only paths we know are safe to use overlapping sstables will do so (and in parallel we also enforce the non-overlapping rule) * compaction doesn't have to even be aware it is happening * what I think has been the biggest problem, the actual safe application of state changes (which were never atomic and could actually screw themselves up willfully through assertions) will be transactional and ensure exceptions do not interrupt their execution. it will also encapsulate its own safe rollback, so if we screw up somewhere, it will fix it for us. I don't pretend it'll be 100% first time, but I think this new state will be safer by a significant margin than the pre-early-open state, which we are still seeing bug reports for in the 2.0 line, and has been the cause of many serious bugs (and at least one major public downtime of a well known deployment). I very much hope all of these changes will restore confidence in not only the early open feature, but resource management in general, and hopefully reduce the burden on all maintainers. > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14360030#comment-14360030 ] Marcus Eriksson commented on CASSANDRA-8833: [~thobbs] do we have any statistics on how much we gain in real life by the redistribution? > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353097#comment-14353097 ] Tyler Hobbs commented on CASSANDRA-8833: bq. see if we can solve the index summary redistribution some other way (only do it on compaction?) to be able to make SSTableReader totally immutable again, got any insight here Tyler Hobbs? We could switch to doing it on compaction only. I suppose we would determine the new summary interval based on the read rates for SSTables participating in the compaction. The main downside is that it would be less responsive to changes. For example, if a new SSTable is flushed that shadows most of the entries in an old SSTable, the old SSTable may keep its existing (and possibly expensive) summary interval until the next compaction. > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353003#comment-14353003 ] Benedict commented on CASSANDRA-8833: - bq. I think it would be OK to run it straight up current 2.1 vs current 2.1 with sstable_preemptive_open_interval_in_mb: -1 as well, just to get a feeling for it This isn't a like-for-like comparison, as without preemptive open we need to evict the page cache for compaction results to avoid inactive data polluting it. > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352999#comment-14352999 ] Marcus Eriksson commented on CASSANDRA-8833: [~enigmacurry] sure, that works (though, I think it would be OK to run it straight up current 2.1 vs current 2.1 with {{sstable_preemptive_open_interval_in_mb: -1}} as well, just to get a feeling for it) > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352991#comment-14352991 ] Benedict commented on CASSANDRA-8833: - bq. run the exact same test that works too; should give a reasonable idea of ballpark effect on current state > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352980#comment-14352980 ] Ryan McGuire commented on CASSANDRA-8833: - [~benedict] I think what [~krummas] was saying was run the exact same test (on the exact same versions) as CASSANDRA-6916, since that original test was done on spinning disk. We would only test a newer C* after those results came in for SSDs. Is that the test you have in mind [~krummas]? > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352815#comment-14352815 ] Benedict commented on CASSANDRA-8833: - bq. run benchmarks again, this time on ssds (Ryan McGuire could your team help out with that?) This is a little tricky, since we've disabled the eviction of page cache entries for the new files in 2.1, so we'll need to roll a patch to reintroduce it. It's also worth caveating in advance that, since testing SSDs we likely care about latency more than throughput, Gil Tene's arguments (my final thoughts about which can be found [here|https://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CCEQFjAA&url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsg%2Fmechanical-sympathy%2FicNZJejUHfE%2FLMQdQZJwvxEJ&ei=2X39VLi5B4XbUdKUgMAO&usg=AFQjCNF4_q-i6496RrmjZiXHZzAEmOy48A&bvm=bv.87611401,d.d24]) about stress in its current state do hold water, and we we will significantly underestimate any negative effect on latency, although I may be able to calculate a rough adjustment after the fact. This isn't a problem for provisioning tests, since you should never saturate throughput for such a test, but for testing these kinds of effects we absolutely should. Fixing stress to properly simulate all of the messages at a target throughput rate is on my hit list, but won't be done for a few weeks at least. > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352719#comment-14352719 ] Marcus Eriksson commented on CASSANDRA-8833: Coming back to this, what I would like is: # run benchmarks again, this time on ssds ([~enigmacurry] could your team help out with that?) # see if we can solve the index summary redistribution some other way (only do it on compaction?) to be able to make SSTableReader totally immutable again, got any insight here [~thobbs]? > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329257#comment-14329257 ] Benedict commented on CASSANDRA-8833: - bq. pre-opening basically makes them mutable so in hindsight it's pretty clearly a major issue. Pre-opening did _not_ make them mutable. Index summary redistribution did this, and I followed suit in the way it dealt with this. We _will_ have to eliminate index summary redistribution as well if we want to restore immutability. > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329253#comment-14329253 ] Benedict commented on CASSANDRA-8833: - I'm not sure I follow. Is your comment intended for this ticket? > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329221#comment-14329221 ] T Jake Luciani commented on CASSANDRA-8833: --- I think the reason pre-opening results has exposed so many issues is we had always assumed in the code that sstables are immutable. pre-opening basically makes them mutable so in hindsight it's pretty clearly a major issue. I'm on the fence as to what we should do here. Perhaps we can solve this performance problem differently once CASSANDRA-5863 is in. If we control the page cache we can expunge the old sstables and warm the new one more intelligently than with the FADVISE tricks > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329192#comment-14329192 ] Matt Stump commented on CASSANDRA-8833: --- I'm not against refactoring of SSTableReader, I think it's overly complicated, but I also think that the fix being proposed is overly complicated. You can't get away from OS specific code because the code to manage the file system cache is OS specific. The trySkipCache logic in CLibrary is currently specific to Linux. I've slightly broadened it to Linux and some BSD variants (excluding OSX) in my patch for CASSANDRA-8160. We can get the desired behavior for Windows through selective use of the SEC_NOCACHE flag when opening the memory mapped file for windows, but that would require pushing the cache advisory logic one lower deeper. I don't think switching to an NIO channel will alleviate us from the need to provide hints to the OS of which IO is or is not appropriate to cache. > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14328236#comment-14328236 ] Benedict commented on CASSANDRA-8833: - bq. Ah - so the 9% w/populate_io_cache_on_flush was the pathological case and the crazy cliff drop-off was the normal use-case then? A bit of clarification there goes a long way. Sorry, I didn't respond to this, so to avoid any lack of clarity: yep, spot on. > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14328025#comment-14328025 ] Benedict commented on CASSANDRA-8833: - bq. there may be unintended side-effects to some of those changes I have no doubt, which is partially why I consider this premature. We had a period of problems (not isolated to this, I want to reiterate), and we've had a very recent attempt to put them to bed. But not let that settle down to see how successful it has been. I'm expecting a few more minor hiccups. If we see some more major ones, especially regressions, that's a different matter. bq. Thus far we've avoided platform-specific code-paths as much as possible, but that seems a simple enough solution that it would be worth looking into. I'd actually be totally cool with (and maybe even prefer) making this a universal behaviour. That way we don't have multiple mappings of the same file, and we still retain most of the upside, with no behavioural split. > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14328019#comment-14328019 ] Joshua McKenzie commented on CASSANDRA-8833: bq. Make your mind up Touché. :) I have numbers on another ticket and Branimir's been seeing similar effects on the read path on Windows, so this is more a symptom of me being lazy on bringing them into this discussion (combined with a lack of rigor on producing those #'s thus far). bq. the two aren't mutually exclusive Absolutely true, but the headache associated with renaming or deleting files on Windows is compounded with hard-links and memory-mapping. Delaying the mapping until finalization of the sstable would be a simple solution however that's adding more complexity on top of an already complex situation. Thus far we've avoided platform-specific code-paths as much as possible, but that seems a simple enough solution that it would be worth looking into. Regarding the flurry of recent fixes - as you're well aware, little in this code-base is as simple as it may appear at first glance. I suspect we're going to have other things we need to tidy up with these recent commits and there may be unintended side-effects to some of those changes. A lot of hand-waving here, certainly, but past experience indicates that changes that touch that many places in the code-base almost always have some surprises in store for us. bq. I'm not certain what you're referring to here Ah - so the 9% w/populate_io_cache_on_flush was the pathological case and the crazy cliff drop-off was the normal use-case then? A bit of clarification there goes a long way. > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327970#comment-14327970 ] Benedict commented on CASSANDRA-8833: - bq. I suspect that memory mapping will eclipse performance gains bq. I'd prefer hard #'s from test beds. Make your mind up ;) Joking aside, the two aren't mutually exclusive. We only rule out early reopening with mmap because of hard links, but they only live for the duration of the early reopening. We could delay mmapping until the final opening. The two are also very different performance profile implications; early opening mitigates worst case behaviour far more importantly than improving overall system performance (although it does that too). bq. Code-changes to stabilize this have been increasing quite a bit recently They almost all happened within the past month, when I finished up my other commitments and decided to try and put a lid on this and the other correlated problems. So I'm not sure that's indicative of much. > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327961#comment-14327961 ] Joshua McKenzie commented on CASSANDRA-8833: Regarding Windows: early re-open with memory-mapped index files isn't a technical feasibility due to ntfs restrictions and I suspect that memory mapping will eclipse performance gains from early re-open for most use-cases. CASSANDRA-8709 will make it an option w/out mmap and we can leave that as the default path. I'm familiar with the cost leading up to this point in time to stabilize this feature and the other systems it stressed. Code-changes to stabilize this have been increasing quite a bit recently so I don't see evidence that we're near the end of stabilization for this feature yet, and as you stated: bq. every line touched is a new bug in waiting Also, the performance gains from early re-open look to vary greatly depending on test setup; the graphs referenced in CASSANDRA-6916 vary from a night-and-day comparison down to a 9% total ops improvement on the regression case. When speaking of the performance improvement of this feature in the comments above I see several phrases that stick out to me: "can be", "Possibly introducing", "may see", "could be". I'd prefer hard #'s from test beds. > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327746#comment-14327746 ] Marcus Eriksson commented on CASSANDRA-8833: I agree that we might almost be there and it would suck to 'give up' at this point, but the reason I created the ticket now is that I fear the issues around this will continue, and that it will make it harder to implement new features around compaction (or anything involving sstables really) in the future. > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327699#comment-14327699 ] Benedict commented on CASSANDRA-8833: - The TL;DR of my view is: We've only just really (or nearly) finished the work for this, and many of its problems have been more general than just it. So aborting now is premature. However it is broader in scope than originally envisaged, and it has been painful - especially for Marcus (and also for myself) - so I understand the desire to see the back of it, and won't fight for its retention. > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327688#comment-14327688 ] Benedict commented on CASSANDRA-8833: - We still have to deal with the complexity of the fact there are multiple instances of a given reader, which is I think where almost all of the problems stem from, and is certainly the most complex part to reason about and get right. I don't think the existence of other kinds of instance is significantly more complex, but I may be wrong about that. > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327685#comment-14327685 ] Marcus Eriksson commented on CASSANDRA-8833: bq. SSTableReader lifecycles are still approximately as complex due to index summary redistribution. Don't see how that can be as complex as we don't open early, move starts or shadow any files when doing this > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327678#comment-14327678 ] Benedict commented on CASSANDRA-8833: - Well, it certainly has turned out to be more complicated than expected. In particular "shadowed" is something I would have liked to avoid. The mistake I made when implementing this, apart from not realising LCS required no overlaps, was to not expend a great deal of effort making it _more_ robust than the existing solution. Many of the problems encountered here have exhibited in the prior framework, or in other new pieces that are unrelated, just less frequently. Cleanup of compaction has been buggy since time immemorial; bloom filters, index summaries and entire sstables have been left allocated indefinitely until very recent fixes. Index summary redistribution has had all of the same DataTracker problems. So an argument could be made that this is actually helping us out, by exercising these codepaths more aggressively. After a major overhaul, there was an oversight (CASSANDRA-8764) which is probably to be expected, and is the impetus for this ticket. This overhaul has fixed numerous problems prior as well as new. Removing this will introduce its own risks (every line touched is a new bug in waiting), and doesn't eliminate as much complexity as might be desired. SSTableReader lifecycles are still approximately as complex due to index summary redistribution. Bloom filter resizing (CASSANDRA-6633) will have the same requirement. The same goes for shared resources. Reference counting has been broken for a long time, and we are now in a safer position than we have ever been (although improvements are still needed, such as CASSANDRA-8568), and the latest bugs we are finding in 2.0 are known not to affect 2.1. We wouldn't be there without this as an impetus. So, many of the changes will not be rolled back, because they were necessary in themselves. They were just less pressing. There would still be a significant removal of code, though. As to the improvement: the compaction "cliff" can be pretty significant if you're compacting huge files. Possibly introducing several minutes (or longer) of seriously degraded performance. If you compact tens of gigabytes you may see your entire page cache wiped out. Even with SSDs that's going to be painful for a period. Without SSDs it could be tens of minutes. Other than that, I'll let everyone else reach a conclusion about this. > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8833) Stop opening compaction results early
[ https://issues.apache.org/jira/browse/CASSANDRA-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327612#comment-14327612 ] Jonathan Ellis commented on CASSANDRA-8833: --- /cc [~pmcfadin] [~mstump] [~rssvihla] > Stop opening compaction results early > - > > Key: CASSANDRA-8833 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8833 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 2.1.4 > > > We should simplify the code base by not doing early opening of compaction > results. It makes it very hard to reason about sstable life cycles since they > can be in many different states, "opened early", "starts moved", "shadowed", > "final", instead of as before, basically just one (tmp files are not really > 'live' yet so I don't count those). The ref counting of shared resources > between sstables in these different states is also hard to reason about. This > has caused quite a few issues since we released 2.1 > I think it all boils down to a performance vs code complexity issue, is > opening compaction results early really 'worth it' wrt the performance gain? > The results in CASSANDRA-6916 sure look like the benefits are big enough, but > the difference should not be as big for people on SSDs (which most people who > care about latencies are) > WDYT [~benedict] [~jbellis] [~iamaleksey] [~JoshuaMcKenzie]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)