[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 16kb
[ https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662070#comment-16662070 ] Benjamin Roth commented on CASSANDRA-13241: --- Agreed > Lower default chunk_length_in_kb from 64kb to 16kb > -- > > Key: CASSANDRA-13241 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13241 > Project: Cassandra > Issue Type: Wish > Components: Core >Reporter: Benjamin Roth >Assignee: Ariel Weisberg >Priority: Major > Attachments: CompactIntegerSequence.java, > CompactIntegerSequenceBench.java, CompactSummingIntegerSequence.java > > > Having a too low chunk size may result in some wasted disk space. A too high > chunk size may lead to massive overreads and may have a critical impact on > overall system performance. > In my case, the default chunk size lead to peak read IOs of up to 1GB/s and > avg reads of 200MB/s. After lowering chunksize (of course aligned with read > ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s. > The risk of (physical) overreads is increasing with lower (page cache size) / > (total data size) ratio. > High chunk sizes are mostly appropriate for bigger payloads pre request but > if the model consists rather of small rows or small resultsets, the read > overhead with 64kb chunk size is insanely high. This applies for example for > (small) skinny rows. > Please also see here: > https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY > To give you some insights what a difference it can make (460GB data, 128GB > RAM): > - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L > - Disk throughput: https://cl.ly/2a0Z250S1M3c > - This shows, that the request distribution remained the same, so no "dynamic > snitch magic": https://cl.ly/3E0t1T1z2c0J -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 16kb
[ https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661948#comment-16661948 ] Benjamin Roth commented on CASSANDRA-13241: --- 4-8kb will not "destroy" the OS page cache. Linux Pages are 4kb by default, so 4kb chunks perfectly fit into cache pages. Actually read-ahead will kill your performance if you have a lot of disk-reads going on. This can kill your page cache if your dataset is a lot larger than available memory and you are doing many random reads with small resultsets. We use 4kb chunks and we observed a TREMENDOUS difference in IO reads when disabling read ahead completely. With default read ahead kernel settings, the physical read IO is roughly 20-30x in our use case, specifically it was like ~20MB/s vs 600MB/s. Sum-up: Not 4KB chunk size alone is the problem but all components have to be tuned and aligned to remove bottlenecks and make the whole system performant. The specific params always depend on the particular case. > Lower default chunk_length_in_kb from 64kb to 16kb > -- > > Key: CASSANDRA-13241 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13241 > Project: Cassandra > Issue Type: Wish > Components: Core >Reporter: Benjamin Roth >Assignee: Ariel Weisberg >Priority: Major > Attachments: CompactIntegerSequence.java, > CompactIntegerSequenceBench.java, CompactSummingIntegerSequence.java > > > Having a too low chunk size may result in some wasted disk space. A too high > chunk size may lead to massive overreads and may have a critical impact on > overall system performance. > In my case, the default chunk size lead to peak read IOs of up to 1GB/s and > avg reads of 200MB/s. After lowering chunksize (of course aligned with read > ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s. > The risk of (physical) overreads is increasing with lower (page cache size) / > (total data size) ratio. > High chunk sizes are mostly appropriate for bigger payloads pre request but > if the model consists rather of small rows or small resultsets, the read > overhead with 64kb chunk size is insanely high. This applies for example for > (small) skinny rows. > Please also see here: > https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY > To give you some insights what a difference it can make (460GB data, 128GB > RAM): > - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L > - Disk throughput: https://cl.ly/2a0Z250S1M3c > - This shows, that the request distribution remained the same, so no "dynamic > snitch magic": https://cl.ly/3E0t1T1z2c0J -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13798) Disallow filtering on non-primary-key base column for MV
[ https://issues.apache.org/jira/browse/CASSANDRA-13798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141663#comment-16141663 ] Benjamin Roth commented on CASSANDRA-13798: --- Even if the current implementation has known issues, you cannot kill that (or any other) feature just like that. As [~JoshuaMcKenzie] mentioned, how do you treat existing installations + schemas? If I was affected (I really have to check this) this would either force me to change my schema or to be blocked on updates. Both is not viable if the current solution works for my needs. For example I am not really affected if I have an insert-only payload or if my data does not expire. What you of course can do: Spit our a warning in the logs on bootstrap if the schema is affected and on schema changes that are affected and refer to a JIRA. So one can decide to stay with it or to migrate the schema / model to be not affected any more. My 2 cents. > Disallow filtering on non-primary-key base column for MV > > > Key: CASSANDRA-13798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13798 > Project: Cassandra > Issue Type: Bug > Components: Materialized Views >Reporter: ZhaoYang >Assignee: ZhaoYang > > We should probably consider disallow filtering conditions on non-primary-key > base column for Materialized View which is introduced in CASSANDRA-10368. > The main problem is that the liveness of view row is now depending on > multiple base columns (multiple filtered non-pk base column + base column > used in view pk) and this semantic could not be properly support without > drastic storage format changes. (SEE CASSANDRA-11500, > [background|https://issues.apache.org/jira/browse/CASSANDRA-11500?focusedCommentId=16119823=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16119823]) > We should step back and re-consider the non-primary-key filtering feature > together with supporting multiple non-PK cols in MV clustering key in > CASSANDRA-10226. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13299) Potential OOMs and lock contention in write path streams
[ https://issues.apache.org/jira/browse/CASSANDRA-13299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16134749#comment-16134749 ] Benjamin Roth commented on CASSANDRA-13299: --- Sorry for the late response, I was on vacation. No, I am not working on that ticket. But thanks a lot for your efforts (not only) on that ticket! > Potential OOMs and lock contention in write path streams > > > Key: CASSANDRA-13299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13299 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth >Assignee: ZhaoYang > > I see a potential OOM, when a stream (e.g. repair) goes through the write > path as it is with MVs. > StreamReceiveTask gets a bunch of SSTableReaders. These produce rowiterators > and they again produce mutations. So every partition creates a single > mutation, which in case of (very) big partitions can result in (very) big > mutations. Those are created on heap and stay there until they finished > processing. > I don't think it is necessary to create a single mutation for each partition. > Why don't we implement a PartitionUpdateGeneratorIterator that takes a > UnfilteredRowIterator and a max size and spits out PartitionUpdates to be > used to create and apply mutations? > The max size should be something like min(reasonable_absolute_max_size, > max_mutation_size, commitlog_segment_size / 2). reasonable_absolute_max_size > could be like 16M or sth. > A mutation shouldn't be too large as it also affects MV partition locking. > The longer a MV partition is locked during a stream, the higher chances are > that WTE's occur during streams. > I could also imagine that a max number of updates per mutation regardless of > size in bytes could make sense to avoid lock contention. > Love to get feedback and suggestions, incl. naming suggestions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13066) Fast streaming with materialized views
[ https://issues.apache.org/jira/browse/CASSANDRA-13066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079906#comment-16079906 ] Benjamin Roth commented on CASSANDRA-13066: --- No. Go ahead! > Fast streaming with materialized views > -- > > Key: CASSANDRA-13066 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13066 > Project: Cassandra > Issue Type: Improvement > Components: Materialized Views, Streaming and Messaging >Reporter: Benjamin Roth >Assignee: Benjamin Roth > Fix For: 4.0 > > > I propose adding a configuration option to send streams of tables with MVs > not through the regular write path. > This may be either a global option or better a CF option. > Background: > A repair of a CF with an MV that is much out of sync creates many streams. > These streams all go through the regular write path to assert local > consistency of the MV. This again causes a read before write for every single > mutation which again puts a lot of pressure on the node - much more than > simply streaming the SSTable down. > In some cases this can be avoided. Instead of only repairing the base table, > all base + mv tables would have to be repaired. But this can break eventual > consistency between base table and MV. The proposed behaviour is always safe, > when having append-only MVs. It also works when using CL_QUORUM writes but it > cannot be absolutely guaranteed, that a quorum write is applied atomically, > so this can also lead to inconsistencies, if a quorum write is started but > one node dies in the middle of a request. > So, this proposal can help a lot in some situations but also can break > consistency in others. That's why it should be left upon the operator if that > behaviour is appropriate for individual use cases. > This issue came up here: > https://issues.apache.org/jira/browse/CASSANDRA-12888?focusedCommentId=15736599=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15736599 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13464) Failed to create Materialized view with a specific token range
[ https://issues.apache.org/jira/browse/CASSANDRA-13464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062535#comment-16062535 ] Benjamin Roth commented on CASSANDRA-13464: --- I personally don't see a use case. The token range in the mv does not relate to a predictable or contiguous range of data in the base table. So I don't know why I would like to do sth like that. If you want to partition your mvs into more tables the you should rather think of a different partition key from my point of view > Failed to create Materialized view with a specific token range > -- > > Key: CASSANDRA-13464 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13464 > Project: Cassandra > Issue Type: Improvement >Reporter: Natsumi Kojima >Assignee: Krishna Dattu Koneru >Priority: Minor > Labels: materializedviews > > Failed to create Materialized view with a specific token range. > Example : > {code:java} > $ ccm create "MaterializedView" -v 3.0.13 > $ ccm populate -n 3 > $ ccm start > $ ccm status > Cluster: 'MaterializedView' > --- > node1: UP > node3: UP > node2: UP > $ccm node1 cqlsh > Connected to MaterializedView at 127.0.0.1:9042. > [cqlsh 5.0.1 | Cassandra 3.0.13 | CQL spec 3.4.0 | Native protocol v4] > Use HELP for help. > cqlsh> CREATE KEYSPACE test WITH replication = {'class':'SimpleStrategy', > 'replication_factor':3}; > cqlsh> CREATE TABLE test.test ( id text PRIMARY KEY , value1 text , value2 > text, value3 text); > $ccm node1 ring test > Datacenter: datacenter1 > == > AddressRackStatus State LoadOwns > Token > > 3074457345618258602 > 127.0.0.1 rack1 Up Normal 64.86 KB100.00% > -9223372036854775808 > 127.0.0.2 rack1 Up Normal 86.49 KB100.00% > -3074457345618258603 > 127.0.0.3 rack1 Up Normal 89.04 KB100.00% > 3074457345618258602 > $ ccm node1 cqlsh > cqlsh> INSERT INTO test.test (id, value1 , value2, value3 ) VALUES ('aaa', > 'aaa', 'aaa' ,'aaa'); > cqlsh> INSERT INTO test.test (id, value1 , value2, value3 ) VALUES ('bbb', > 'bbb', 'bbb' ,'bbb'); > cqlsh> SELECT token(id),id,value1 FROM test.test; > system.token(id) | id | value1 > --+-+ > -4737872923231490581 | aaa |aaa > -3071845237020185195 | bbb |bbb > (2 rows) > cqlsh> CREATE MATERIALIZED VIEW test.test_view AS SELECT value1, id FROM > test.test WHERE id IS NOT NULL AND value1 IS NOT NULL AND TOKEN(id) > > -9223372036854775808 AND TOKEN(id) < -3074457345618258603 PRIMARY KEY(value1, > id) WITH CLUSTERING ORDER BY (id ASC); > ServerError: java.lang.ClassCastException: > org.apache.cassandra.cql3.TokenRelation cannot be cast to > org.apache.cassandra.cql3.SingleColumnRelation > {code} > Stacktrace : > {code:java} > INFO [MigrationStage:1] 2017-04-19 18:32:48,131 ColumnFamilyStore.java:389 - > Initializing test.test > WARN [SharedPool-Worker-1] 2017-04-19 18:44:07,263 FBUtilities.java:337 - > Trigger directory doesn't exist, please create it and try again. > ERROR [SharedPool-Worker-1] 2017-04-19 18:46:10,072 QueryMessage.java:128 - > Unexpected error during query > java.lang.ClassCastException: org.apache.cassandra.cql3.TokenRelation cannot > be cast to org.apache.cassandra.cql3.SingleColumnRelation > at > org.apache.cassandra.db.view.View.relationsToWhereClause(View.java:275) > ~[apache-cassandra-3.0.13.jar:3.0.13] > at > org.apache.cassandra.cql3.statements.CreateViewStatement.announceMigration(CreateViewStatement.java:219) > ~[apache-cassandra-3.0.13.jar:3.0.13] > at > org.apache.cassandra.cql3.statements.SchemaAlteringStatement.execute(SchemaAlteringStatement.java:93) > ~[apache-cassandra-3.0.13.jar:3.0.13] > at > org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:206) > ~[apache-cassandra-3.0.13.jar:3.0.13] > at > org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:237) > ~[apache-cassandra-3.0.13.jar:3.0.13] > at > org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:222) > ~[apache-cassandra-3.0.13.jar:3.0.13] > at > org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) > ~[apache-cassandra-3.0.13.jar:3.0.13] > at > org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) > [apache-cassandra-3.0.13.jar:3.0.13] > at > org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) > [apache-cassandra-3.0.13.jar:3.0.13] > at >
[jira] [Comment Edited] (CASSANDRA-13127) Materialized Views: View row expires too soon
[ https://issues.apache.org/jira/browse/CASSANDRA-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002850#comment-16002850 ] Benjamin Roth edited comment on CASSANDRA-13127 at 5/9/17 3:18 PM: --- [~zznate] I have never "stumbled upon" it but i was also never taking care of that. We also only use the default TTLs, so maybe this is a different thing. Sounds like it is worth investigating on it. I think there should be a consensus what the expected behaviour should be (especially on partial updates), then some tests should be written and then the desired behaviour should be implemented if it is not met, yet. Unfortunately I don't have the time at the moment to dig deep into that issue and go through all the details in the code to see what's going on here. Just from reading the description of the issue it totally looks like a bug - at least from a user's point of view. If nobody else is available for testing and debugging, maybe I can take a deeper look in 1-2 weeks. was (Author: brstgt): [~zznate] I have never "stumbled upon" it but i was also never taking care of that. We also only use the default TTLs, so maybe this is a different thing. Sounds like it is worth investigating on it. I think there should be a consensus what the expected behaviour should be, then some tests should be written and then the desired behaviour should be implemented if it is not met, yet. Unfortunately I don't have the time at the moment to dig deep into that issue and go through all the details in the code to see what's going on here. Just from reading the description of the issue it totally looks like a bug - at least from a user's point of view. > Materialized Views: View row expires too soon > - > > Key: CASSANDRA-13127 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13127 > Project: Cassandra > Issue Type: Bug > Components: Local Write-Read Paths, Materialized Views >Reporter: Duarte Nunes > > Consider the following commands, ran against trunk: > {code} > echo "DROP MATERIALIZED VIEW ks.mv; DROP TABLE ks.base;" | bin/cqlsh > echo "CREATE TABLE ks.base (p int, c int, v int, PRIMARY KEY (p, c));" | > bin/cqlsh > echo "CREATE MATERIALIZED VIEW ks.mv AS SELECT p, c FROM base WHERE p IS NOT > NULL AND c IS NOT NULL PRIMARY KEY (c, p);" | bin/cqlsh > echo "INSERT INTO ks.base (p, c) VALUES (0, 0) USING TTL 10;" | bin/cqlsh > # wait for row liveness to get closer to expiration > sleep 6; > echo "UPDATE ks.base USING TTL 8 SET v = 0 WHERE p = 0 and c = 0;" | bin/cqlsh > echo "SELECT p, c, ttl(v) FROM ks.base; SELECT * FROM ks.mv;" | bin/cqlsh > p | c | ttl(v) > ---+---+ > 0 | 0 | 7 > (1 rows) > c | p > ---+--- > 0 | 0 > (1 rows) > # wait for row liveness to expire > sleep 4; > echo "SELECT p, c, ttl(v) FROM ks.base; SELECT * FROM ks.mv;" | bin/cqlsh > p | c | ttl(v) > ---+---+ > 0 | 0 | 3 > (1 rows) > c | p > ---+--- > (0 rows) > {code} > Notice how the view row is removed even though the base row is still live. I > would say this is because in ViewUpdateGenerator#computeLivenessInfoForEntry > the TTLs are compared instead of the expiration times, but I'm not sure I'm > getting that far ahead in the code when updating a column that's not in the > view. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13127) Materialized Views: View row expires too soon
[ https://issues.apache.org/jira/browse/CASSANDRA-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002850#comment-16002850 ] Benjamin Roth commented on CASSANDRA-13127: --- [~zznate] I have never "stumbled upon" it but i was also never taking care of that. We also only use the default TTLs, so maybe this is a different thing. Sounds like it is worth investigating on it. I think there should be a consensus what the expected behaviour should be, then some tests should be written and then the desired behaviour should be implemented if it is not met, yet. Unfortunately I don't have the time at the moment to dig deep into that issue and go through all the details in the code to see what's going on here. Just from reading the description of the issue it totally looks like a bug - at least from a user's point of view. > Materialized Views: View row expires too soon > - > > Key: CASSANDRA-13127 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13127 > Project: Cassandra > Issue Type: Bug > Components: Local Write-Read Paths, Materialized Views >Reporter: Duarte Nunes > > Consider the following commands, ran against trunk: > {code} > echo "DROP MATERIALIZED VIEW ks.mv; DROP TABLE ks.base;" | bin/cqlsh > echo "CREATE TABLE ks.base (p int, c int, v int, PRIMARY KEY (p, c));" | > bin/cqlsh > echo "CREATE MATERIALIZED VIEW ks.mv AS SELECT p, c FROM base WHERE p IS NOT > NULL AND c IS NOT NULL PRIMARY KEY (c, p);" | bin/cqlsh > echo "INSERT INTO ks.base (p, c) VALUES (0, 0) USING TTL 10;" | bin/cqlsh > # wait for row liveness to get closer to expiration > sleep 6; > echo "UPDATE ks.base USING TTL 8 SET v = 0 WHERE p = 0 and c = 0;" | bin/cqlsh > echo "SELECT p, c, ttl(v) FROM ks.base; SELECT * FROM ks.mv;" | bin/cqlsh > p | c | ttl(v) > ---+---+ > 0 | 0 | 7 > (1 rows) > c | p > ---+--- > 0 | 0 > (1 rows) > # wait for row liveness to expire > sleep 4; > echo "SELECT p, c, ttl(v) FROM ks.base; SELECT * FROM ks.mv;" | bin/cqlsh > p | c | ttl(v) > ---+---+ > 0 | 0 | 3 > (1 rows) > c | p > ---+--- > (0 rows) > {code} > Notice how the view row is removed even though the base row is still live. I > would say this is because in ViewUpdateGenerator#computeLivenessInfoForEntry > the TTLs are compared instead of the expiration times, but I'm not sure I'm > getting that far ahead in the code when updating a column that's not in the > view. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC
[ https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001349#comment-16001349 ] Benjamin Roth commented on CASSANDRA-12888: --- I am absolutely aware of that! That's why I also added some tests. All unit tests ran well so far. I also ran a bunch of probably related dtests like the MV test suite. It also looked good. Nevertheless, I don't want to urge you, take the time you need! I appreciate any feedback! > Incremental repairs broken for MVs and CDC > -- > > Key: CASSANDRA-12888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12888 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Stefan Podkowinski >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.0.x, 3.11.x > > > SSTables streamed during the repair process will first be written locally and > afterwards either simply added to the pool of existing sstables or, in case > of existing MVs or active CDC, replayed on mutation basis: > As described in {{StreamReceiveTask.OnCompletionRunnable}}: > {quote} > We have a special path for views and for CDC. > For views, since the view requires cleaning up any pre-existing state, we > must put all partitions through the same write path as normal mutations. This > also ensures any 2is are also updated. > For CDC-enabled tables, we want to ensure that the mutations are run through > the CommitLog so they can be archived by the CDC process on discard. > {quote} > Using the regular write path turns out to be an issue for incremental > repairs, as we loose the {{repaired_at}} state in the process. Eventually the > streamed rows will end up in the unrepaired set, in contrast to the rows on > the sender site moved to the repaired set. The next repair run will stream > the same data back again, causing rows to bounce on and on between nodes on > each repair. > See linked dtest on steps to reproduce. An example for reproducing this > manually using ccm can be found > [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC
[ https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000302#comment-16000302 ] Benjamin Roth commented on CASSANDRA-12888: --- [~pauloricardomg] Have you been able to take a look at the patch, yet? If not, maybe someone else wants to review it? It's there for 2 months now. The patch introduces multiple (active) memtables per CF. This could also help in other situations like: https://issues.apache.org/jira/browse/CASSANDRA-13290 https://issues.apache.org/jira/browse/CASSANDRA-12991 > Incremental repairs broken for MVs and CDC > -- > > Key: CASSANDRA-12888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12888 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Stefan Podkowinski >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.0.x, 3.11.x > > > SSTables streamed during the repair process will first be written locally and > afterwards either simply added to the pool of existing sstables or, in case > of existing MVs or active CDC, replayed on mutation basis: > As described in {{StreamReceiveTask.OnCompletionRunnable}}: > {quote} > We have a special path for views and for CDC. > For views, since the view requires cleaning up any pre-existing state, we > must put all partitions through the same write path as normal mutations. This > also ensures any 2is are also updated. > For CDC-enabled tables, we want to ensure that the mutations are run through > the CommitLog so they can be archived by the CDC process on discard. > {quote} > Using the regular write path turns out to be an issue for incremental > repairs, as we loose the {{repaired_at}} state in the process. Eventually the > streamed rows will end up in the unrepaired set, in contrast to the rows on > the sender site moved to the repaired set. The next repair run will stream > the same data back again, causing rows to bounce on and on between nodes on > each repair. > See linked dtest on steps to reproduce. An example for reproducing this > manually using ccm can be found > [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13065) Skip building views during base table streams on range movements
[ https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960301#comment-15960301 ] Benjamin Roth commented on CASSANDRA-13065: --- [~pauloricardomg] thanks for the review! Would you like to take a look at CASSANDRA-13066 for review? It depends on this patch and I have to touch it anyway. So if you have comments on the concept and namings (and probably you will) I can do that in one run. > Skip building views during base table streams on range movements > > > Key: CASSANDRA-13065 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13065 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth >Assignee: Benjamin Roth >Priority: Critical > Fix For: 4.0 > > > Booting or decommisioning nodes with MVs is unbearably slow as all streams go > through the regular write paths. This causes read-before-writes for every > mutation and during bootstrap it causes them to be sent to batchlog. > The makes it virtually impossible to boot a new node in an acceptable amount > of time. > Using the regular streaming behaviour for consistent range movements works > much better in this case and does not break the MV local consistency contract. > Already tested on own cluster. > Bootstrap case is super easy to handle, decommission case requires > CASSANDRA-13064 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths
[ https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15959033#comment-15959033 ] Benjamin Roth commented on CASSANDRA-13065: --- Looks good. I like putting the requiresViewBuild property into the StreamOperation! > Consistent range movements to not require MV updates to go through write > paths > --- > > Key: CASSANDRA-13065 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13065 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth >Assignee: Benjamin Roth >Priority: Critical > Fix For: 4.0 > > > Booting or decommisioning nodes with MVs is unbearably slow as all streams go > through the regular write paths. This causes read-before-writes for every > mutation and during bootstrap it causes them to be sent to batchlog. > The makes it virtually impossible to boot a new node in an acceptable amount > of time. > Using the regular streaming behaviour for consistent range movements works > much better in this case and does not break the MV local consistency contract. > Already tested on own cluster. > Bootstrap case is super easy to handle, decommission case requires > CASSANDRA-13064 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths
[ https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954589#comment-15954589 ] Benjamin Roth commented on CASSANDRA-13065: --- [~pauloricardomg] There you go. My first complete patch. I hope its ok and works :) https://github.com/Jaumo/cassandra/commit/88699700feb6b9a504df88ff063b82641d7939f7 > Consistent range movements to not require MV updates to go through write > paths > --- > > Key: CASSANDRA-13065 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13065 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth >Assignee: Benjamin Roth >Priority: Critical > Fix For: 4.0 > > > Booting or decommisioning nodes with MVs is unbearably slow as all streams go > through the regular write paths. This causes read-before-writes for every > mutation and during bootstrap it causes them to be sent to batchlog. > The makes it virtually impossible to boot a new node in an acceptable amount > of time. > Using the regular streaming behaviour for consistent range movements works > much better in this case and does not break the MV local consistency contract. > Already tested on own cluster. > Bootstrap case is super easy to handle, decommission case requires > CASSANDRA-13064 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths
[ https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945381#comment-15945381 ] Benjamin Roth commented on CASSANDRA-13065: --- [~pauloricardomg] Gnaaa. Stupid search+replace errors. Made changes according to your feedback: https://github.com/Jaumo/cassandra/commit/7ba773e901bcdb3bf830417ebd07ad4786a5b179 CDC uses write path again. Hope this time everything's ok. Thanks for your patience :) > Consistent range movements to not require MV updates to go through write > paths > --- > > Key: CASSANDRA-13065 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13065 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth >Assignee: Benjamin Roth >Priority: Critical > Fix For: 4.0 > > > Booting or decommisioning nodes with MVs is unbearably slow as all streams go > through the regular write paths. This causes read-before-writes for every > mutation and during bootstrap it causes them to be sent to batchlog. > The makes it virtually impossible to boot a new node in an acceptable amount > of time. > Using the regular streaming behaviour for consistent range movements works > much better in this case and does not break the MV local consistency contract. > Already tested on own cluster. > Bootstrap case is super easy to handle, decommission case requires > CASSANDRA-13064 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths
[ https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942722#comment-15942722 ] Benjamin Roth commented on CASSANDRA-13065: --- [~pauloricardomg] Did you notice my update? > Consistent range movements to not require MV updates to go through write > paths > --- > > Key: CASSANDRA-13065 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13065 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth >Assignee: Benjamin Roth >Priority: Critical > Fix For: 4.0 > > > Booting or decommisioning nodes with MVs is unbearably slow as all streams go > through the regular write paths. This causes read-before-writes for every > mutation and during bootstrap it causes them to be sent to batchlog. > The makes it virtually impossible to boot a new node in an acceptable amount > of time. > Using the regular streaming behaviour for consistent range movements works > much better in this case and does not break the MV local consistency contract. > Already tested on own cluster. > Bootstrap case is super easy to handle, decommission case requires > CASSANDRA-13064 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13226) StreamPlan for incremental repairs flushing memtables unnecessarily
[ https://issues.apache.org/jira/browse/CASSANDRA-13226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15938577#comment-15938577 ] Benjamin Roth commented on CASSANDRA-13226: --- That does not make sense to me. Why should be streamed more than requested? Sounds like waste of resources to me. Streaming more than a repair requires assumes that the system is still creating inconsistent data during the repair. > StreamPlan for incremental repairs flushing memtables unnecessarily > --- > > Key: CASSANDRA-13226 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13226 > Project: Cassandra > Issue Type: Bug >Reporter: Blake Eggleston >Assignee: Blake Eggleston >Priority: Minor > Fix For: 4.0 > > > Since incremental repairs are run against a fixed dataset, there's no need to > flush memtables when streaming for them. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13315) Consistency is confusing for new users
[ https://issues.apache.org/jira/browse/CASSANDRA-13315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903459#comment-15903459 ] Benjamin Roth commented on CASSANDRA-13315: --- I had the same problems in the beginning, so generally +1. But IMHO this should go along with an explaining section in the official docs. > Consistency is confusing for new users > -- > > Key: CASSANDRA-13315 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13315 > Project: Cassandra > Issue Type: Improvement >Reporter: Ryan Svihla > > New users really struggle with consistency level and fall into a large number > of tarpits trying to decide on the right one. > 1. There are a LOT of consistency levels and it's up to the end user to > reason about what combinations are valid and what is really what they intend > it to be. Is there any reason why write at ALL and read at CL TWO is better > than read at CL ONE? > 2. They require a good understanding of failure modes to do well. It's not > uncommon for people to use CL one and wonder why their data is missing. > 3. The serial consistency level "bucket" is confusing to even write about and > easy to get wrong even for experienced users. > So I propose the following steps (EDIT based on Jonathan's comment): > 1. Remove the "serial consistency" level of consistency levels and just have > all consistency levels in one bucket to set, conditions still need to be > required for SERIAL/LOCAL_SERIAL > 2. add 3 new consistency levels pointing to existing ones but that infer > intent much more cleanly: >* EVENTUALLY_CONSISTENT = LOCAL_ONE reads and writes >* HIGHLY_CONSISTENT = LOCAL_QUORUM reads and writes >* TRANSACTIONALLY_CONSISTENT = LOCAL_SERIAL reads and writes > for global levels of this I propose keeping the old ones around, they're > rarely used in the field except by accident or particularly opinionated and > advanced users. > Drivers should put the new consistency levels in a new package and docs > should be updated to suggest their use. Likewise setting default CL should > only provide those three settings and applying it for reads and writes at the > same time. > CQLSH I'm gonna suggest should default to HIGHLY_CONSISTENT. New sysadmins > get surprised by this frequently and I can think of a couple very major > escalations because people were confused what the default behavior was. > The benefit to all this change is we shrink the surface area that one has to > understand when learning Cassandra greatly, and we have far less bad initial > experiences and surprises. New users will more likely be able to wrap their > brains around those 3 ideas more readily then they can "what happens when I > have RF2, QUROUM writes and ONE reads". Advanced users get access to all the > way still, while new users don't have to learn all the ins and outs of > distributed theory just to write data and be able to read it back. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12489) consecutive repairs of same range always finds 'out of sync' in sane cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900951#comment-15900951 ] Benjamin Roth commented on CASSANDRA-12489: --- Awesome. I see that there has been a lot of work done by thelastpickle since I initially forked it from spotifiy (which didn't support 3.x back then). A simple changelog.md would be even more awesome to see if there have been important changes. > consecutive repairs of same range always finds 'out of sync' in sane cluster > > > Key: CASSANDRA-12489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12489 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Benjamin Roth >Assignee: Benjamin Roth > Labels: lhf > Attachments: trace_3_10.1.log.gz, trace_3_10.2.log.gz, > trace_3_10.3.log.gz, trace_3_10.4.log.gz, trace_3_9.1.log.gz, > trace_3_9.2.log.gz > > > No matter how often or when I run the same subrange repair, it ALWAYS tells > me that some ranges are our of sync. Tested in 3.9 + 3.10 (git trunk of > 2016-08-17). The cluster is sane. All nodes are up, cluster is not overloaded. > I guess this is not a desired behaviour. I'd expect that a repair does what > it says and a consecutive repair shouldn't report "out of syncs" any more if > the cluster is sane. > Especially for tables with MVs that puts a lot of pressure during repair as > ranges are repaired over and over again. > See traces of different runs attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12489) consecutive repairs of same range always finds 'out of sync' in sane cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898051#comment-15898051 ] Benjamin Roth commented on CASSANDRA-12489: --- Thanks for the answer. Thats what I thought. But what a right to exist do incremental repairs then have in the real world if (most, many, whatever) people use a tool that makes repairs manageable which eliminates this case. The use case + real benefit is quite limited then, isn't it? Probably thats a philosophic question but I'm curios what other guys think about it and if I am maybe missing a valuable use case. > consecutive repairs of same range always finds 'out of sync' in sane cluster > > > Key: CASSANDRA-12489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12489 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Benjamin Roth >Assignee: Benjamin Roth > Labels: lhf > Attachments: trace_3_10.1.log.gz, trace_3_10.2.log.gz, > trace_3_10.3.log.gz, trace_3_10.4.log.gz, trace_3_9.1.log.gz, > trace_3_9.2.log.gz > > > No matter how often or when I run the same subrange repair, it ALWAYS tells > me that some ranges are our of sync. Tested in 3.9 + 3.10 (git trunk of > 2016-08-17). The cluster is sane. All nodes are up, cluster is not overloaded. > I guess this is not a desired behaviour. I'd expect that a repair does what > it says and a consecutive repair shouldn't report "out of syncs" any more if > the cluster is sane. > Especially for tables with MVs that puts a lot of pressure during repair as > ranges are repaired over and over again. > See traces of different runs attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12489) consecutive repairs of same range always finds 'out of sync' in sane cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898028#comment-15898028 ] Benjamin Roth commented on CASSANDRA-12489: --- May I ask what's the reason that incremental + subrange repair doesn't do anticompaction? Is it because anticompaction is too expensive in this case or to say it in different words: A subrange full repair is cheaper than subrange incremental repair with anticompaction? > consecutive repairs of same range always finds 'out of sync' in sane cluster > > > Key: CASSANDRA-12489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12489 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Benjamin Roth >Assignee: Benjamin Roth > Labels: lhf > Attachments: trace_3_10.1.log.gz, trace_3_10.2.log.gz, > trace_3_10.3.log.gz, trace_3_10.4.log.gz, trace_3_9.1.log.gz, > trace_3_9.2.log.gz > > > No matter how often or when I run the same subrange repair, it ALWAYS tells > me that some ranges are our of sync. Tested in 3.9 + 3.10 (git trunk of > 2016-08-17). The cluster is sane. All nodes are up, cluster is not overloaded. > I guess this is not a desired behaviour. I'd expect that a repair does what > it says and a consecutive repair shouldn't report "out of syncs" any more if > the cluster is sane. > Especially for tables with MVs that puts a lot of pressure during repair as > ranges are repaired over and over again. > See traces of different runs attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13303) CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky
[ https://issues.apache.org/jira/browse/CASSANDRA-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897969#comment-15897969 ] Benjamin Roth commented on CASSANDRA-13303: --- My trunk is older. So I close the ticket > CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super > flaky > --- > > Key: CASSANDRA-13303 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13303 > Project: Cassandra > Issue Type: Bug >Reporter: Benjamin Roth > > On my machine, this test succeeds maybe 1 out of 10 times. > Cause seems to be that sstable is not elected for compation in > worthDroppingTombstones as droppableRatio is 0.0 > I don't know the primary intention of this test, so I didn't touch it but the > conditions are not safe. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (CASSANDRA-13303) CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky
[ https://issues.apache.org/jira/browse/CASSANDRA-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Roth resolved CASSANDRA-13303. --- Resolution: Duplicate Duplicate to CASSANDRA-13038 regression fixes > CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super > flaky > --- > > Key: CASSANDRA-13303 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13303 > Project: Cassandra > Issue Type: Bug >Reporter: Benjamin Roth > > On my machine, this test succeeds maybe 1 out of 10 times. > Cause seems to be that sstable is not elected for compation in > worthDroppingTombstones as droppableRatio is 0.0 > I don't know the primary intention of this test, so I didn't touch it but the > conditions are not safe. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13303) CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky
[ https://issues.apache.org/jira/browse/CASSANDRA-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897966#comment-15897966 ] Benjamin Roth commented on CASSANDRA-13303: --- I read the comments in 13038, I guess we are talking of the same thing. Also MetadataSerializerTest fails on my machine. > CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super > flaky > --- > > Key: CASSANDRA-13303 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13303 > Project: Cassandra > Issue Type: Bug >Reporter: Benjamin Roth > > On my machine, this test succeeds maybe 1 out of 10 times. > Cause seems to be that sstable is not elected for compation in > worthDroppingTombstones as droppableRatio is 0.0 > I don't know the primary intention of this test, so I didn't touch it but the > conditions are not safe. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13303) CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky
[ https://issues.apache.org/jira/browse/CASSANDRA-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897956#comment-15897956 ] Benjamin Roth commented on CASSANDRA-13303: --- CASSANDRA-13038 is fixed in commit a5ce963117acf5e4cf0a31057551f2f42385c398 which I have in my trunk and I don't see a newer commit for 13038. > CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super > flaky > --- > > Key: CASSANDRA-13303 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13303 > Project: Cassandra > Issue Type: Bug >Reporter: Benjamin Roth > > On my machine, this test succeeds maybe 1 out of 10 times. > Cause seems to be that sstable is not elected for compation in > worthDroppingTombstones as droppableRatio is 0.0 > I don't know the primary intention of this test, so I didn't touch it but the > conditions are not safe. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (CASSANDRA-13303) CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky
[ https://issues.apache.org/jira/browse/CASSANDRA-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897942#comment-15897942 ] Benjamin Roth edited comment on CASSANDRA-13303 at 3/6/17 7:58 PM: --- 1. Happens in trunk 2. Maybe not clear enough: Table is simply not compacted as AbstractCompactionStrategy.worthDroppingTombstones returns false as droppableRatio is 0.0 {code} double droppableRatio = sstable.getEstimatedDroppableTombstoneRatio(gcBefore); {code} Message: "should be less than x but was y" was (Author: brstgt): 1. Happens in trunk 2. Maybe not clear enough: Table is simply not compacted as AbstractCompactionStrategy.worthDroppingTombstones returns false as droppableRatio is 0.0 {code} double droppableRatio = sstable.getEstimatedDroppableTombstoneRatio(gcBefore); {code} > CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super > flaky > --- > > Key: CASSANDRA-13303 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13303 > Project: Cassandra > Issue Type: Bug >Reporter: Benjamin Roth > > On my machine, this test succeeds maybe 1 out of 10 times. > Cause seems to be that sstable is not elected for compation in > worthDroppingTombstones as droppableRatio is 0.0 > I don't know the primary intention of this test, so I didn't touch it but the > conditions are not safe. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13303) CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky
[ https://issues.apache.org/jira/browse/CASSANDRA-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897942#comment-15897942 ] Benjamin Roth commented on CASSANDRA-13303: --- 1. Happens in trunk 2. Maybe not clear enough: Table is simply not compacted as AbstractCompactionStrategy.worthDroppingTombstones returns false as droppableRatio is 0.0 {code} double droppableRatio = sstable.getEstimatedDroppableTombstoneRatio(gcBefore); {code} > CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super > flaky > --- > > Key: CASSANDRA-13303 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13303 > Project: Cassandra > Issue Type: Bug >Reporter: Benjamin Roth > > On my machine, this test succeeds maybe 1 out of 10 times. > Cause seems to be that sstable is not elected for compation in > worthDroppingTombstones as droppableRatio is 0.0 > I don't know the primary intention of this test, so I didn't touch it but the > conditions are not safe. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC
[ https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897917#comment-15897917 ] Benjamin Roth commented on CASSANDRA-12888: --- Review would be much appreciated. Don't know if @pauloricardomg still wants to do the review. Please give me some feedback, thanks! > Incremental repairs broken for MVs and CDC > -- > > Key: CASSANDRA-12888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12888 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Stefan Podkowinski >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.0.x, 3.11.x > > > SSTables streamed during the repair process will first be written locally and > afterwards either simply added to the pool of existing sstables or, in case > of existing MVs or active CDC, replayed on mutation basis: > As described in {{StreamReceiveTask.OnCompletionRunnable}}: > {quote} > We have a special path for views and for CDC. > For views, since the view requires cleaning up any pre-existing state, we > must put all partitions through the same write path as normal mutations. This > also ensures any 2is are also updated. > For CDC-enabled tables, we want to ensure that the mutations are run through > the CommitLog so they can be archived by the CDC process on discard. > {quote} > Using the regular write path turns out to be an issue for incremental > repairs, as we loose the {{repaired_at}} state in the process. Eventually the > streamed rows will end up in the unrepaired set, in contrast to the rows on > the sender site moved to the repaired set. The next repair run will stream > the same data back again, causing rows to bounce on and on between nodes on > each repair. > See linked dtest on steps to reproduce. An example for reproducing this > manually using ccm can be found > [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC
[ https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897917#comment-15897917 ] Benjamin Roth edited comment on CASSANDRA-12888 at 3/6/17 7:48 PM: --- Review would be much appreciated. Don't know if [~pauloricardomg] still wants to do the review. Please give me some feedback, thanks! was (Author: brstgt): Review would be much appreciated. Don't know if @pauloricardomg still wants to do the review. Please give me some feedback, thanks! > Incremental repairs broken for MVs and CDC > -- > > Key: CASSANDRA-12888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12888 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Stefan Podkowinski >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.0.x, 3.11.x > > > SSTables streamed during the repair process will first be written locally and > afterwards either simply added to the pool of existing sstables or, in case > of existing MVs or active CDC, replayed on mutation basis: > As described in {{StreamReceiveTask.OnCompletionRunnable}}: > {quote} > We have a special path for views and for CDC. > For views, since the view requires cleaning up any pre-existing state, we > must put all partitions through the same write path as normal mutations. This > also ensures any 2is are also updated. > For CDC-enabled tables, we want to ensure that the mutations are run through > the CommitLog so they can be archived by the CDC process on discard. > {quote} > Using the regular write path turns out to be an issue for incremental > repairs, as we loose the {{repaired_at}} state in the process. Eventually the > streamed rows will end up in the unrepaired set, in contrast to the rows on > the sender site moved to the repaired set. The next repair run will stream > the same data back again, causing rows to bounce on and on between nodes on > each repair. > See linked dtest on steps to reproduce. An example for reproducing this > manually using ccm can be found > [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC
[ https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Roth updated CASSANDRA-12888: -- Status: Patch Available (was: Awaiting Feedback) https://github.com/apache/cassandra/compare/trunk...Jaumo:CASSANDRA-12888 Some dtest assertions: https://github.com/riptano/cassandra-dtest/compare/master...Jaumo:CASSANDRA-12888?expand=1 > Incremental repairs broken for MVs and CDC > -- > > Key: CASSANDRA-12888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12888 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Stefan Podkowinski >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.0.x, 3.11.x > > > SSTables streamed during the repair process will first be written locally and > afterwards either simply added to the pool of existing sstables or, in case > of existing MVs or active CDC, replayed on mutation basis: > As described in {{StreamReceiveTask.OnCompletionRunnable}}: > {quote} > We have a special path for views and for CDC. > For views, since the view requires cleaning up any pre-existing state, we > must put all partitions through the same write path as normal mutations. This > also ensures any 2is are also updated. > For CDC-enabled tables, we want to ensure that the mutations are run through > the CommitLog so they can be archived by the CDC process on discard. > {quote} > Using the regular write path turns out to be an issue for incremental > repairs, as we loose the {{repaired_at}} state in the process. Eventually the > streamed rows will end up in the unrepaired set, in contrast to the rows on > the sender site moved to the repaired set. The next repair run will stream > the same data back again, causing rows to bounce on and on between nodes on > each repair. > See linked dtest on steps to reproduce. An example for reproducing this > manually using ccm can be found > [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (CASSANDRA-13303) CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky
Benjamin Roth created CASSANDRA-13303: - Summary: CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky Key: CASSANDRA-13303 URL: https://issues.apache.org/jira/browse/CASSANDRA-13303 Project: Cassandra Issue Type: Bug Reporter: Benjamin Roth On my machine, this test succeeds maybe 1 out of 10 times. Cause seems to be that sstable is not elected for compation in worthDroppingTombstones as droppableRatio is 0.0 I don't know the primary intention of this test, so I didn't touch it but the conditions are not safe. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC
[ https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897589#comment-15897589 ] Benjamin Roth commented on CASSANDRA-12888: --- Btw.: My concept seems to work, but there is one question left: Why does a StreamSession create unrepaired SSTables? IncomingFileMessage => Creates RangeAwareSSTableWriter:97 => cfs.createSSTableMultiWriter ... => CompactionStrategyManager.createSSTableMultiWriter:185 Will it be marked as repaired later? If so, where/when? Why I ask: The received SSTable has the repairedFlag in RangeAwareSSTableWriter and it's header but it is lost when the SSTable is finished and returned as SSTableReader. > Incremental repairs broken for MVs and CDC > -- > > Key: CASSANDRA-12888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12888 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Stefan Podkowinski >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.0.x, 3.11.x > > > SSTables streamed during the repair process will first be written locally and > afterwards either simply added to the pool of existing sstables or, in case > of existing MVs or active CDC, replayed on mutation basis: > As described in {{StreamReceiveTask.OnCompletionRunnable}}: > {quote} > We have a special path for views and for CDC. > For views, since the view requires cleaning up any pre-existing state, we > must put all partitions through the same write path as normal mutations. This > also ensures any 2is are also updated. > For CDC-enabled tables, we want to ensure that the mutations are run through > the CommitLog so they can be archived by the CDC process on discard. > {quote} > Using the regular write path turns out to be an issue for incremental > repairs, as we loose the {{repaired_at}} state in the process. Eventually the > streamed rows will end up in the unrepaired set, in contrast to the rows on > the sender site moved to the repaired set. The next repair run will stream > the same data back again, causing rows to bounce on and on between nodes on > each repair. > See linked dtest on steps to reproduce. An example for reproducing this > manually using ccm can be found > [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC
[ https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897565#comment-15897565 ] Benjamin Roth commented on CASSANDRA-12888: --- I also had this idea but it wont work. It will totally break base <> MV consistency. Except: You lock all involved partitions for the whole process. But that would create insanely long locks and a extremely high contention > Incremental repairs broken for MVs and CDC > -- > > Key: CASSANDRA-12888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12888 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Stefan Podkowinski >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.0.x, 3.11.x > > > SSTables streamed during the repair process will first be written locally and > afterwards either simply added to the pool of existing sstables or, in case > of existing MVs or active CDC, replayed on mutation basis: > As described in {{StreamReceiveTask.OnCompletionRunnable}}: > {quote} > We have a special path for views and for CDC. > For views, since the view requires cleaning up any pre-existing state, we > must put all partitions through the same write path as normal mutations. This > also ensures any 2is are also updated. > For CDC-enabled tables, we want to ensure that the mutations are run through > the CommitLog so they can be archived by the CDC process on discard. > {quote} > Using the regular write path turns out to be an issue for incremental > repairs, as we loose the {{repaired_at}} state in the process. Eventually the > streamed rows will end up in the unrepaired set, in contrast to the rows on > the sender site moved to the repaired set. The next repair run will stream > the same data back again, causing rows to bounce on and on between nodes on > each repair. > See linked dtest on steps to reproduce. An example for reproducing this > manually using ccm can be found > [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13299) Potential OOMs and lock contention in write path streams
[ https://issues.apache.org/jira/browse/CASSANDRA-13299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Roth updated CASSANDRA-13299: -- Description: I see a potential OOM, when a stream (e.g. repair) goes through the write path as it is with MVs. StreamReceiveTask gets a bunch of SSTableReaders. These produce rowiterators and they again produce mutations. So every partition creates a single mutation, which in case of (very) big partitions can result in (very) big mutations. Those are created on heap and stay there until they finished processing. I don't think it is necessary to create a single mutation for each partition. Why don't we implement a PartitionUpdateGeneratorIterator that takes a UnfilteredRowIterator and a max size and spits out PartitionUpdates to be used to create and apply mutations? The max size should be something like min(reasonable_absolute_max_size, max_mutation_size, commitlog_segment_size / 2). reasonable_absolute_max_size could be like 16M or sth. A mutation shouldn't be too large as it also affects MV partition locking. The longer a MV partition is locked during a stream, the higher chances are that WTE's occur during streams. I could also imagine that a max number of updates per mutation regardless of size in bytes could make sense to avoid lock contention. Love to get feedback and suggestions, incl. naming suggestions. was: I see a potential OOM, when a stream (e.g. repair) goes through the write path as it is with MVs. StreamReceiveTask gets a bunch of SSTableReaders. These produce rowiterators and they again produce mutations. So every partition creates a single mutation, which in case of (very) big partitions can result in (very) big mutations. Those are created on heap and stay there until they are processed. I don't think it is necessary to create a single mutation for each partition. Why don't we implement a PartitionUpdateGeneratorIterator that takes a UnfilteredRowIterator and a max size and spits out PartitionUpdates to be used to create and apply mutations? The max size should be something like min(reasonable_absolute_max_size, max_mutation_size, commitlog_segment_size / 2). reasonable_absolute_max_size could be like 16M or sth. A mutation shouldn't be too large as it also affects MV partition locking. As longer a MV partition is locked during a stream, the higher chances are that WTE's occur during streams. I could also imagine that a max number of updates per mutation regardless of size in bytes could make sense to avoid lock contention. Love to get feedback and suggestions, incl. naming suggestions. > Potential OOMs and lock contention in write path streams > > > Key: CASSANDRA-13299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13299 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth > > I see a potential OOM, when a stream (e.g. repair) goes through the write > path as it is with MVs. > StreamReceiveTask gets a bunch of SSTableReaders. These produce rowiterators > and they again produce mutations. So every partition creates a single > mutation, which in case of (very) big partitions can result in (very) big > mutations. Those are created on heap and stay there until they finished > processing. > I don't think it is necessary to create a single mutation for each partition. > Why don't we implement a PartitionUpdateGeneratorIterator that takes a > UnfilteredRowIterator and a max size and spits out PartitionUpdates to be > used to create and apply mutations? > The max size should be something like min(reasonable_absolute_max_size, > max_mutation_size, commitlog_segment_size / 2). reasonable_absolute_max_size > could be like 16M or sth. > A mutation shouldn't be too large as it also affects MV partition locking. > The longer a MV partition is locked during a stream, the higher chances are > that WTE's occur during streams. > I could also imagine that a max number of updates per mutation regardless of > size in bytes could make sense to avoid lock contention. > Love to get feedback and suggestions, incl. naming suggestions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13299) Potential OOMs and lock contention in write path streams
[ https://issues.apache.org/jira/browse/CASSANDRA-13299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896460#comment-15896460 ] Benjamin Roth commented on CASSANDRA-13299: --- Relating to CASSANDRA-11670 this would also allow to write all streamed mutations to commitlog without problems. I also propose to do so with small streams (see CASSANDRA-13290). Writing small streams (e.g. < 100KB) to commitlog does not require a flush at the end of stream receive. This avoids tons of flushes if tons of tiny streams are sent during a repair session. These are maybe apples and oranges but fixing all these ends makes the whole process less error prone and probably perform better. > Potential OOMs and lock contention in write path streams > > > Key: CASSANDRA-13299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13299 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth > > I see a potential OOM, when a stream (e.g. repair) goes through the write > path as it is with MVs. > StreamReceiveTask gets a bunch of SSTableReaders. These produce rowiterators > and they again produce mutations. So every partition creates a single > mutation, which in case of (very) big partitions can result in (very) big > mutations. Those are created on heap and stay there until they are processed. > I don't think it is necessary to create a single mutation for each partition. > Why don't we implement a PartitionUpdateGeneratorIterator that takes a > UnfilteredRowIterator and a max size and spits out PartitionUpdates to be > used to create and apply mutations? > The max size should be something like min(reasonable_absolute_max_size, > max_mutation_size, commitlog_segment_size / 2). reasonable_absolute_max_size > could be like 16M or sth. > A mutation shouldn't be too large as it also affects MV partition locking. As > longer a MV partition is locked during a stream, the higher chances are that > WTE's occur during streams. > I could also imagine that a max number of updates per mutation regardless of > size in bytes could make sense to avoid lock contention. > Love to get feedback and suggestions, incl. naming suggestions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (CASSANDRA-13299) Potential OOMs and lock contention in write path streams
Benjamin Roth created CASSANDRA-13299: - Summary: Potential OOMs and lock contention in write path streams Key: CASSANDRA-13299 URL: https://issues.apache.org/jira/browse/CASSANDRA-13299 Project: Cassandra Issue Type: Improvement Reporter: Benjamin Roth I see a potential OOM, when a stream (e.g. repair) goes through the write path as it is with MVs. StreamReceiveTask gets a bunch of SSTableReaders. These produce rowiterators and they again produce mutations. So every partition creates a single mutation, which in case of (very) big partitions can result in (very) big mutations. Those are created on heap and stay there until they are processed. I don't think it is necessary to create a single mutation for each partition. Why don't we implement a PartitionUpdateGeneratorIterator that takes a UnfilteredRowIterator and a max size and spits out PartitionUpdates to be used to create and apply mutations? The max size should be something like min(reasonable_absolute_max_size, max_mutation_size, commitlog_segment_size / 2). reasonable_absolute_max_size could be like 16M or sth. A mutation shouldn't be too large as it also affects MV partition locking. As longer a MV partition is locked during a stream, the higher chances are that WTE's occur during streams. I could also imagine that a max number of updates per mutation regardless of size in bytes could make sense to avoid lock contention. Love to get feedback and suggestions, incl. naming suggestions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC
[ https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896363#comment-15896363 ] Benjamin Roth commented on CASSANDRA-12888: --- Just perfect! Thats EXACTLY what I wanted to know and it helps me to continue to work on that ticket. I started some proof of concept work but it still needs some finalizing and exhaustive testing. Concept is quite simple in theory (hopefully in reality, too): Each SSTable may now contain more than one active memtable each for unrepaired and repaired data (like compaction pools). The repaired memtable does not have to be resident all the time, only during repairs, so my intention was to create it on demand and not to automatically re-create on after flush. To make things simple for a start my intention was to apply flush behaviour to both memtables. Either both or none is flushed. Maybe this could be optimized in future. > Incremental repairs broken for MVs and CDC > -- > > Key: CASSANDRA-12888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12888 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Stefan Podkowinski >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.0.x, 3.11.x > > > SSTables streamed during the repair process will first be written locally and > afterwards either simply added to the pool of existing sstables or, in case > of existing MVs or active CDC, replayed on mutation basis: > As described in {{StreamReceiveTask.OnCompletionRunnable}}: > {quote} > We have a special path for views and for CDC. > For views, since the view requires cleaning up any pre-existing state, we > must put all partitions through the same write path as normal mutations. This > also ensures any 2is are also updated. > For CDC-enabled tables, we want to ensure that the mutations are run through > the CommitLog so they can be archived by the CDC process on discard. > {quote} > Using the regular write path turns out to be an issue for incremental > repairs, as we loose the {{repaired_at}} state in the process. Eventually the > streamed rows will end up in the unrepaired set, in contrast to the rows on > the sender site moved to the repaired set. The next repair run will stream > the same data back again, causing rows to bounce on and on between nodes on > each repair. > See linked dtest on steps to reproduce. An example for reproducing this > manually using ccm can be found > [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC
[ https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895140#comment-15895140 ] Benjamin Roth commented on CASSANDRA-12888: --- Maybe my former question perished: What effect does the repairedAt flag have for future repairs except that a non-zero value means, that a table has been repaired at some time? I am happy about any code references. > Incremental repairs broken for MVs and CDC > -- > > Key: CASSANDRA-12888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12888 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Stefan Podkowinski >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.0.x, 3.11.x > > > SSTables streamed during the repair process will first be written locally and > afterwards either simply added to the pool of existing sstables or, in case > of existing MVs or active CDC, replayed on mutation basis: > As described in {{StreamReceiveTask.OnCompletionRunnable}}: > {quote} > We have a special path for views and for CDC. > For views, since the view requires cleaning up any pre-existing state, we > must put all partitions through the same write path as normal mutations. This > also ensures any 2is are also updated. > For CDC-enabled tables, we want to ensure that the mutations are run through > the CommitLog so they can be archived by the CDC process on discard. > {quote} > Using the regular write path turns out to be an issue for incremental > repairs, as we loose the {{repaired_at}} state in the process. Eventually the > streamed rows will end up in the unrepaired set, in contrast to the rows on > the sender site moved to the repaired set. The next repair run will stream > the same data back again, causing rows to bounce on and on between nodes on > each repair. > See linked dtest on steps to reproduce. An example for reproducing this > manually using ccm can be found > [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC
[ https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894645#comment-15894645 ] Benjamin Roth commented on CASSANDRA-12888: --- For detailed explanation an excerpt from that discussion: - ... there are still possible scenarios where it's possible to break consistency by repairing the base and the view separately even with QUORUM writes: Initial state: Base replica A: {k0=v0, ts=0} Base replica B: {k0=v0, ts=0} Base replica C: {k0=v0, ts=0} View paired replica A: {v1=k0, ts=0} View paired replica B: {v0=k0, ts=0} View paired replica C: {v0=k0, ts=0} Base replica A receives write {k1=v1, ts=1}, propagates to view paired replica A and dies. Current state is: Base replica A: {k1=v1, ts=1} Base replica B: {k0=v0, ts=0} Base replica C: {k0=v0, ts=0} View paired replica A: {v1=k1, ts=1} View paired replica B: {v0=k0, ts=0} View paired replica C: {v0=k0, ts=0} Base replica B and C receives write {k2=v2, ts=2}, write to their paired replica. Write is successful at QUORUM. Current state is: Base replica A: {k1=v1, ts=1} Base replica B: {k2=v2, ts=2} Base replica C: {k2=v2, ts=2} View paired replica A: {v1=k1, ts=1} View paired replica B: {v2=k2, ts=2} View paired replica C: {v2=k2, ts=2} A returns from the dead. Repair base table: Base replica A: {k2=v2, ts=2} Base replica B: {k2=v2, ts=2} Base replica C: {k2=v2, ts=2} Repair MV: View paired replica A: {v1=k1, ts=1} and {v2=k2, ts=2} View paired replica B: {v1=k1, ts=1} and {v2=k2, ts=2} View paired replica C: {v1=k1, ts=1} and {v2=k2, ts=2} So, this requires replica A to generate a tombstone for {v1=k1, ts=1} during repair of base table. > Incremental repairs broken for MVs and CDC > -- > > Key: CASSANDRA-12888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12888 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Stefan Podkowinski >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.0.x, 3.11.x > > > SSTables streamed during the repair process will first be written locally and > afterwards either simply added to the pool of existing sstables or, in case > of existing MVs or active CDC, replayed on mutation basis: > As described in {{StreamReceiveTask.OnCompletionRunnable}}: > {quote} > We have a special path for views and for CDC. > For views, since the view requires cleaning up any pre-existing state, we > must put all partitions through the same write path as normal mutations. This > also ensures any 2is are also updated. > For CDC-enabled tables, we want to ensure that the mutations are run through > the CommitLog so they can be archived by the CDC process on discard. > {quote} > Using the regular write path turns out to be an issue for incremental > repairs, as we loose the {{repaired_at}} state in the process. Eventually the > streamed rows will end up in the unrepaired set, in contrast to the rows on > the sender site moved to the repaired set. The next repair run will stream > the same data back again, causing rows to bounce on and on between nodes on > each repair. > See linked dtest on steps to reproduce. An example for reproducing this > manually using ccm can be found > [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC
[ https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894639#comment-15894639 ] Benjamin Roth commented on CASSANDRA-12888: --- A repair must go through the write path expect for some special cases. I also first had the idea to avoid it completely but in discussion with [~pauloricardomg] it turned out that this may introduce inconsistencies that these could only be fixed by a view rebuild because it leaves stale rows. I know that all this stuff is totally counter-intuitive but just streaming "blindly" all sstables (incl. MV tables) down is not correct. This is why I am trying to improve the mutation based approached. If the Sstables for MVs get corrupted or lost, the only way to fix it is to rebuild that view again. There is no way (at least none I see atm) that would consistenly repair a view from other nodes. The underlying principle is: - A view must always be consistent to its base-table - A view does not have to be consistent among nodes, thats handled by repairing the base table Thats also why you don't have to run a repair before building a view. Nevertheless it would not help anyway because you NEVER have a 100% guaranteed consistent state. A repair only guarantees consistency until the point of repair. The "know what you are doing" option is offered by CASSANDRA-13066 btw. In this ticket I also adopted the election of CFs (tables + mvs) when doing a keyspace repair depending if the MV is repaired by stream or by mutation. > Incremental repairs broken for MVs and CDC > -- > > Key: CASSANDRA-12888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12888 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Stefan Podkowinski >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.0.x, 3.11.x > > > SSTables streamed during the repair process will first be written locally and > afterwards either simply added to the pool of existing sstables or, in case > of existing MVs or active CDC, replayed on mutation basis: > As described in {{StreamReceiveTask.OnCompletionRunnable}}: > {quote} > We have a special path for views and for CDC. > For views, since the view requires cleaning up any pre-existing state, we > must put all partitions through the same write path as normal mutations. This > also ensures any 2is are also updated. > For CDC-enabled tables, we want to ensure that the mutations are run through > the CommitLog so they can be archived by the CDC process on discard. > {quote} > Using the regular write path turns out to be an issue for incremental > repairs, as we loose the {{repaired_at}} state in the process. Eventually the > streamed rows will end up in the unrepaired set, in contrast to the rows on > the sender site moved to the repaired set. The next repair run will stream > the same data back again, causing rows to bounce on and on between nodes on > each repair. > See linked dtest on steps to reproduce. An example for reproducing this > manually using ccm can be found > [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-11670) Rebuilding or streaming MV generates mutations larger than max_mutation_size_in_kb
[ https://issues.apache.org/jira/browse/CASSANDRA-11670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894128#comment-15894128 ] Benjamin Roth commented on CASSANDRA-11670: --- Hmmm I guess this possibly breaks consistency in repair streams + MVs In StreamReceiveTasks mutations are applied without commitlog because cf is flushed at the end. But MVs are not flushed. Either: Also flush all MVs at the end of the stream task - it is not said that this is actually required for all MVs as we do not know where view replica updates eventually go. Or: Enable commitlog for view replica updates even if base table does not commit log writes. > Rebuilding or streaming MV generates mutations larger than > max_mutation_size_in_kb > -- > > Key: CASSANDRA-11670 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11670 > Project: Cassandra > Issue Type: Bug > Components: Configuration, Streaming and Messaging >Reporter: Anastasia Osintseva >Assignee: Paulo Motta > Fix For: 3.0.10, 3.10 > > > I have in cluster 2 DC, in each DC - 2 Nodes. I wanted to add 1 node to each > DC. One node has been added successfully after I had made scrubing. > Now I'm trying to add node to another DC, but get error: > org.apache.cassandra.streaming.StreamException: Stream failed. > After scrubing and repair I get the same error. > {noformat} > ERROR [StreamReceiveTask:5] 2016-04-27 00:33:21,082 Keyspace.java:492 - > Unknown exception caught while attempting to update MaterializedView! > messages_dump.messages > java.lang.IllegalArgumentException: Mutation of 34974901 bytes is too large > for the maxiumum size of 33554432 > at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:264) > ~[apache-cassandra-3.0.5.jar:3.0.5] > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:469) > [apache-cassandra-3.0.5.jar:3.0.5] > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:384) > [apache-cassandra-3.0.5.jar:3.0.5] > at org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:205) > [apache-cassandra-3.0.5.jar:3.0.5] > at org.apache.cassandra.db.Mutation.apply(Mutation.java:217) > [apache-cassandra-3.0.5.jar:3.0.5] > at > org.apache.cassandra.batchlog.BatchlogManager.store(BatchlogManager.java:146) > ~[apache-cassandra-3.0.5.jar:3.0.5] > at > org.apache.cassandra.service.StorageProxy.mutateMV(StorageProxy.java:724) > ~[apache-cassandra-3.0.5.jar:3.0.5] > at > org.apache.cassandra.db.view.ViewManager.pushViewReplicaUpdates(ViewManager.java:149) > ~[apache-cassandra-3.0.5.jar:3.0.5] > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:487) > [apache-cassandra-3.0.5.jar:3.0.5] > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:384) > [apache-cassandra-3.0.5.jar:3.0.5] > at org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:205) > [apache-cassandra-3.0.5.jar:3.0.5] > at org.apache.cassandra.db.Mutation.apply(Mutation.java:217) > [apache-cassandra-3.0.5.jar:3.0.5] > at org.apache.cassandra.db.Mutation.applyUnsafe(Mutation.java:236) > [apache-cassandra-3.0.5.jar:3.0.5] > at > org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:169) > [apache-cassandra-3.0.5.jar:3.0.5] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [na:1.8.0_11] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > [na:1.8.0_11] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [na:1.8.0_11] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_11] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_11] > ERROR [StreamReceiveTask:5] 2016-04-27 00:33:21,082 > StreamReceiveTask.java:214 - Error applying streamed data: > java.lang.IllegalArgumentException: Mutation of 34974901 bytes is too large > for the maxiumum size of 33554432 > at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:264) > ~[apache-cassandra-3.0.5.jar:3.0.5] > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:469) > ~[apache-cassandra-3.0.5.jar:3.0.5] > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:384) > ~[apache-cassandra-3.0.5.jar:3.0.5] > at org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:205) > ~[apache-cassandra-3.0.5.jar:3.0.5] > at org.apache.cassandra.db.Mutation.apply(Mutation.java:217) > ~[apache-cassandra-3.0.5.jar:3.0.5] > at > org.apache.cassandra.batchlog.BatchlogManager.store(BatchlogManager.java:146) > ~[apache-cassandra-3.0.5.jar:3.0.5] > at >
[jira] [Comment Edited] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC
[ https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818955#comment-15818955 ] Benjamin Roth edited comment on CASSANDRA-12888 at 3/3/17 9:57 AM: --- Hi Victor, We use MVs in Production with billions of records without known data loss. Painful + slow refers to repairs and range movements (e.g. bootstrap + decommission). Also (as mentioned in this ticket) incremental repairs dont work, so full repair creates some overhead. Until 3.10 there are bugs leading to write timeouts, even to NPEs and completely blocked mutation stages. This could even bring your cluster down. In 3.10 some issues have been resolved - actually we use a patched trunk version which is 1-2 months old. Depending on your model, MVs can help a lot from a developer perspective. Some cases are very resource intensive to manage without MVs, requiring distributed locks and/or CAS. For append-only workloads, it may be simpler to NOT use MVs at the moment. They aren't very complex and MVs wont help that much compared to the problems that may raise with them. Painful scenarios: There is no recipe for that. You may or may not encounter performance issues, depending on your model and your workload. I'd recommend not to use MVs that use a different partition key on the MV than on the base table as this requires inter-node communication for EVERY write operation. So you can easily kill your cluster with bulk operations (like in streaming). At the moment our cluster runs stable but it took months to find all the bottlenecks, race conditions, resume from failures and so on. So my recommendation: You can get it work but you need time and you should not start with critical data, at least if it is not backed by another stable storage. And you should use 3.10 when it is finally released or build your own version from trunk. I would not recommend to use < 3.10 for MVs. Btw.: Our own patched version does some dirty tricks, that may lead to inconsistencies in some situations but we prefer some possible inconsistencies (we can deal with) over performance bottlenecks. I created several tickets to improve MV performance in some streaming situations but it will take some time to really improve that situation. Does this answer your question? was (Author: brstgt): Hi Victor, We use MVs in Production with billions of records without known data loss. Painful + slow refers to repairs and range movements (e.g. bootstrap + decommission). Also (as mentioned in this ticket) incremental repairs dont work, so full repair creates some overhead. Until 3.10 there are bugs leading to write timeouts, even to NPEs and completely blocked mutation stages. This could even bring your cluster down. In 3.10 some issues have been resolved - actually we use a patched trunk version which is 1-2 months old. Depending on your model, MVs can help a lot from a developer perspective. Some cases are very resource intensive to manage without MVs, requiring distributed locks and/or CAS. For append-only workloads, it may be simpler to NOT use MVs at the moment. They aren't very complex and MVs wont help that much compared to the problems that may raise with them. Painful scenarios: There is no recipe for that. You may or may not encounter performance issues, depending on your model and your workload. I'd recommend not to use MVs that use a different partition key on the MV than on the base table as this requires inter-node communication for EVERY write operation. So you can easily kill your cluster with bulk operations (like in streaming). At the moment our cluster runs stable but it took months to find all the bottlenecks, race conditions, resume from failures and so on. So my recommendation: You can get it work but you need time and you should not start with critical data, at least if it is not backed by another stable storage. And you should use 3.10 when it is finally released or build your own version from trunk. I would not recommend to use < 3.10 for MVs. Btw.: Our own patched version does some dirty tricks, that may lead to inconsistencies in some situations but we prefer some possible inconsistencies (we can deal with) over performance bottlenecks. I created several tickets to improve MV performance in some streaming situations but it will take some time to really improve that situation. Does this answer your question? -- Benjamin Roth Prokurist Jaumo GmbH · www.jaumo.com Wehrstraße 46 · 73035 Göppingen · Germany Phone +49 7161 304880-6 · Fax +49 7161 304880-1 AG Ulm · HRB 731058 · Managing Director: Jens Kammerer > Incremental repairs broken for MVs and CDC > -- > > Key: CASSANDRA-12888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12888 > Project: Cassandra > Issue Type: Bug > Components: Streaming and
[jira] [Comment Edited] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC
[ https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894006#comment-15894006 ] Benjamin Roth edited comment on CASSANDRA-12888 at 3/3/17 9:55 AM: --- I am about to hack a proof of concept for that issue. Concept: Each mutation and each partition update have a "repairedAt" flag. This will be passed along through the whole write path like MV updates and serialization for remote MV updates. Then repair + non-repair mutations have to be separated in memtables and flushed to separate SSTables. From what I can see it should be easier to maintain a memtable each for repaired and non-repaired data than tracking the repair state within a memtable. Passing repair state to replicas isn't even necessary as replicas should not be repaired directly anyway, so no need for a repairedAt state. My question is: How important is the exact value of "repairedAt". Is it possible to merge updates with different repair timestamps into a single memtable and finally flush them to an SSTable with repairedAt set to the latest or earliest repairedAt timestamps of all mutations in the memtable? Or would that produce repair-inconsistencies or sth? Any feedback? was (Author: brstgt): I am about to hack a proof of concept for that issue. Concept: Each mutation and each partition update have a "repairedAt" flag. This will be passed along through the whole write path like MV updates and serialization for remote MV updates. Then repair + non-repair mutations have to be separated in memtables and flushed to separate SSTables. From what I can see it should be easier to maintain a memtable each for repaired and non-repaired data than tracking the repair state within a memtable. My question is: How important is the exact value of "repairedAt". Is it possible to merge updates with different repair timestamps into a single memtable and finally flush them to an SSTable with repairedAt set to the latest or earliest repairedAt timestamps of all mutations in the memtable? Or would that produce repair-inconsistencies or sth? Any feedback? > Incremental repairs broken for MVs and CDC > -- > > Key: CASSANDRA-12888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12888 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Stefan Podkowinski >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.0.x, 3.11.x > > > SSTables streamed during the repair process will first be written locally and > afterwards either simply added to the pool of existing sstables or, in case > of existing MVs or active CDC, replayed on mutation basis: > As described in {{StreamReceiveTask.OnCompletionRunnable}}: > {quote} > We have a special path for views and for CDC. > For views, since the view requires cleaning up any pre-existing state, we > must put all partitions through the same write path as normal mutations. This > also ensures any 2is are also updated. > For CDC-enabled tables, we want to ensure that the mutations are run through > the CommitLog so they can be archived by the CDC process on discard. > {quote} > Using the regular write path turns out to be an issue for incremental > repairs, as we loose the {{repaired_at}} state in the process. Eventually the > streamed rows will end up in the unrepaired set, in contrast to the rows on > the sender site moved to the repaired set. The next repair run will stream > the same data back again, causing rows to bounce on and on between nodes on > each repair. > See linked dtest on steps to reproduce. An example for reproducing this > manually using ccm can be found > [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC
[ https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894006#comment-15894006 ] Benjamin Roth commented on CASSANDRA-12888: --- I am about to hack a proof of concept for that issue. Concept: Each mutation and each partition update have a "repairedAt" flag. This will be passed along through the whole write path like MV updates and serialization for remote MV updates. Then repair + non-repair mutations have to be separated in memtables and flushed to separate SSTables. From what I can see it should be easier to maintain a memtable each for repaired and non-repaired data than tracking the repair state within a memtable. My question is: How important is the exact value of "repairedAt". Is it possible to merge updates with different repair timestamps into a single memtable and finally flush them to an SSTable with repairedAt set to the latest or earliest repairedAt timestamps of all mutations in the memtable? Or would that produce repair-inconsistencies or sth? Any feedback? > Incremental repairs broken for MVs and CDC > -- > > Key: CASSANDRA-12888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12888 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Stefan Podkowinski >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.0.x, 3.11.x > > > SSTables streamed during the repair process will first be written locally and > afterwards either simply added to the pool of existing sstables or, in case > of existing MVs or active CDC, replayed on mutation basis: > As described in {{StreamReceiveTask.OnCompletionRunnable}}: > {quote} > We have a special path for views and for CDC. > For views, since the view requires cleaning up any pre-existing state, we > must put all partitions through the same write path as normal mutations. This > also ensures any 2is are also updated. > For CDC-enabled tables, we want to ensure that the mutations are run through > the CommitLog so they can be archived by the CDC process on discard. > {quote} > Using the regular write path turns out to be an issue for incremental > repairs, as we loose the {{repaired_at}} state in the process. Eventually the > streamed rows will end up in the unrepaired set, in contrast to the rows on > the sender site moved to the repaired set. The next repair run will stream > the same data back again, causing rows to bounce on and on between nodes on > each repair. > See linked dtest on steps to reproduce. An example for reproducing this > manually using ccm can be found > [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13066) Fast streaming with materialized views
[ https://issues.apache.org/jira/browse/CASSANDRA-13066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Roth updated CASSANDRA-13066: -- Fix Version/s: 4.0 Status: Patch Available (was: Open) Depends upon 13064+13065 https://github.com/Jaumo/cassandra/commits/CASSANDRA-13066 > Fast streaming with materialized views > -- > > Key: CASSANDRA-13066 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13066 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth >Assignee: Benjamin Roth > Fix For: 4.0 > > > I propose adding a configuration option to send streams of tables with MVs > not through the regular write path. > This may be either a global option or better a CF option. > Background: > A repair of a CF with an MV that is much out of sync creates many streams. > These streams all go through the regular write path to assert local > consistency of the MV. This again causes a read before write for every single > mutation which again puts a lot of pressure on the node - much more than > simply streaming the SSTable down. > In some cases this can be avoided. Instead of only repairing the base table, > all base + mv tables would have to be repaired. But this can break eventual > consistency between base table and MV. The proposed behaviour is always safe, > when having append-only MVs. It also works when using CL_QUORUM writes but it > cannot be absolutely guaranteed, that a quorum write is applied atomically, > so this can also lead to inconsistencies, if a quorum write is started but > one node dies in the middle of a request. > So, this proposal can help a lot in some situations but also can break > consistency in others. That's why it should be left upon the operator if that > behaviour is appropriate for individual use cases. > This issue came up here: > https://issues.apache.org/jira/browse/CASSANDRA-12888?focusedCommentId=15736599=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15736599 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (CASSANDRA-13293) MV read-before-write can be omitted for some operations
Benjamin Roth created CASSANDRA-13293: - Summary: MV read-before-write can be omitted for some operations Key: CASSANDRA-13293 URL: https://issues.apache.org/jira/browse/CASSANDRA-13293 Project: Cassandra Issue Type: Improvement Reporter: Benjamin Roth A view that has the same fields in the primary key as its base table (i call it a congruent key), does not require read-before-writes except: - Range deletes - Partition deletes If the view uses filters on non-pk columns either a rbw is required or a write that does not match the filter has to be turned into a delete. In doubt I'd stay with the current behaviour and to a rbw. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13290) Optimizing very small repair streams
[ https://issues.apache.org/jira/browse/CASSANDRA-13290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892981#comment-15892981 ] Benjamin Roth commented on CASSANDRA-13290: --- While CASSANDRA-8911 brings a very interesting approach into the game, however the solution is rather complex (as can be seen in stalled ticket activity). I guess both 12888 and this ticket are lower hanging fruits for a start, whereas I don't say it's not worth working on both approaches. > Optimizing very small repair streams > > > Key: CASSANDRA-13290 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13290 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth > > I often encountered repair scenarios, where a lot of tiny repair streams were > created. This results in hundrets, thousands or even ten-thousands super > small SSTables (some bytes to some kbytes). > This puts a lot of pressure on compaction and may even lead to a crash due to > too many open files - I also encountered this. > What could help to avoid this: > After CASSANDRA-12888 is resolved, a tiny stream (e.g. < 100kb) could be sent > through the write path to be buffered by memtables instead of creating an > SSTable each. > Without CASSANDRA-12888 this would break incremental repairs. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (CASSANDRA-13290) Optimizing very small repair streams
Benjamin Roth created CASSANDRA-13290: - Summary: Optimizing very small repair streams Key: CASSANDRA-13290 URL: https://issues.apache.org/jira/browse/CASSANDRA-13290 Project: Cassandra Issue Type: Improvement Reporter: Benjamin Roth I often encountered repair scenarios, where a lot of tiny repair streams were created. This results in hundrets, thousands or even ten-thousands super small SSTables (some bytes to some kbytes). This puts a lot of pressure on compaction and may even lead to a crash due to too many open files - I also encountered this. What could help to avoid this: After CASSANDRA-12888 is resolved, a tiny stream (e.g. < 100kb) could be sent through the write path to be buffered by memtables instead of creating an SSTable each. Without CASSANDRA-12888 this would break incremental repairs. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (CASSANDRA-12985) Update MV repair documentation
[ https://issues.apache.org/jira/browse/CASSANDRA-12985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Roth resolved CASSANDRA-12985. --- Resolution: Resolved MV repairs won't be changed as proposed > Update MV repair documentation > -- > > Key: CASSANDRA-12985 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12985 > Project: Cassandra > Issue Type: Task >Reporter: Benjamin Roth > Fix For: 3.0.x, 3.11.x > > > Due to CASSANDRA-12888 the way MVs are being repaired changes. > Before: > MV has been repaired by repairing the base table. Repairing the MV separately > has been discouraged. Also repairing a whole KS containing a MV has been > discouraged. > After: > MVs are treated like any other table in repairs. They also MUST be repaired > as any other table. Base table does NOT repair MV any more. > Repairing a whole keyspace is encouraged. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb
[ https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892143#comment-15892143 ] Benjamin Roth commented on CASSANDRA-13241: --- So... who's gonna do it? > Lower default chunk_length_in_kb from 64kb to 4kb > - > > Key: CASSANDRA-13241 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13241 > Project: Cassandra > Issue Type: Wish > Components: Core >Reporter: Benjamin Roth > > Having a too low chunk size may result in some wasted disk space. A too high > chunk size may lead to massive overreads and may have a critical impact on > overall system performance. > In my case, the default chunk size lead to peak read IOs of up to 1GB/s and > avg reads of 200MB/s. After lowering chunksize (of course aligned with read > ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s. > The risk of (physical) overreads is increasing with lower (page cache size) / > (total data size) ratio. > High chunk sizes are mostly appropriate for bigger payloads pre request but > if the model consists rather of small rows or small resultsets, the read > overhead with 64kb chunk size is insanely high. This applies for example for > (small) skinny rows. > Please also see here: > https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY > To give you some insights what a difference it can make (460GB data, 128GB > RAM): > - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L > - Disk throughput: https://cl.ly/2a0Z250S1M3c > - This shows, that the request distribution remained the same, so no "dynamic > snitch magic": https://cl.ly/3E0t1T1z2c0J -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths
[ https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892040#comment-15892040 ] Benjamin Roth edited comment on CASSANDRA-13065 at 3/2/17 12:06 PM: [~pauloricardomg] Please also look at follow-up commits on review. I added 2 more commits (for 13064+13065) with tiny fixes. Depending on your feedback, I can rearrange them if required. https://github.com/Jaumo/cassandra/tree/CASSANDRA-13064 was (Author: brstgt): [~pauloricardomg] Please also look at follow-up commits on review. I added 2 more commits (13064+13065) with tiny fixes. Depending on your feedback, I can rearrange them if required. > Consistent range movements to not require MV updates to go through write > paths > --- > > Key: CASSANDRA-13065 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13065 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth >Assignee: Benjamin Roth >Priority: Critical > Fix For: 4.0 > > > Booting or decommisioning nodes with MVs is unbearably slow as all streams go > through the regular write paths. This causes read-before-writes for every > mutation and during bootstrap it causes them to be sent to batchlog. > The makes it virtually impossible to boot a new node in an acceptable amount > of time. > Using the regular streaming behaviour for consistent range movements works > much better in this case and does not break the MV local consistency contract. > Already tested on own cluster. > Bootstrap case is super easy to handle, decommission case requires > CASSANDRA-13064 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths
[ https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892040#comment-15892040 ] Benjamin Roth commented on CASSANDRA-13065: --- [~pauloricardomg] Please also look at follow-up commits on review. I added 2 more commits (13064+13065) with tiny fixes. Depending on your feedback, I can rearrange them if required. > Consistent range movements to not require MV updates to go through write > paths > --- > > Key: CASSANDRA-13065 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13065 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth >Assignee: Benjamin Roth >Priority: Critical > Fix For: 4.0 > > > Booting or decommisioning nodes with MVs is unbearably slow as all streams go > through the regular write paths. This causes read-before-writes for every > mutation and during bootstrap it causes them to be sent to batchlog. > The makes it virtually impossible to boot a new node in an acceptable amount > of time. > Using the regular streaming behaviour for consistent range movements works > much better in this case and does not break the MV local consistency contract. > Already tested on own cluster. > Bootstrap case is super easy to handle, decommission case requires > CASSANDRA-13064 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13279) Table default settings file
[ https://issues.apache.org/jira/browse/CASSANDRA-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890143#comment-15890143 ] Benjamin Roth commented on CASSANDRA-13279: --- I think [~slebresne] is basically right and I think transparency for users (meaning better docs at a central place) + better defaults are actually more important than convenience. If convenience introduces potential problems, then it's not really worth it. I am also ok with closing it. > Table default settings file > --- > > Key: CASSANDRA-13279 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13279 > Project: Cassandra > Issue Type: Wish > Components: Configuration >Reporter: Romain Hardouin >Priority: Minor > Labels: config, documentation > > Following CASSANDRA-13241 we often see that there is no one-size-fits-all > value for settings. We can't find a sweet spot for every use cases. > It's true for settings in cassandra.yaml but as [~brstgt] said for > {{chunk_length_in_kb}}: "this is somewhat hidden for the average user". > Many table settings are somewhat hidden for the average user. Some people > will think RTFM but if a file - say tables.yaml - contains default values for > table settings, more people would pay attention to them. And of course this > file could contain useful comments and guidance. > Example with SSTable compression options: > {code} > # General comments about sstable compression > compression: > # First of all: explain what is it. We split each SSTable into chunks, > etc. > # Explain when users should lower this value (e.g. 4) or when a higher > value like 64 or 128 are recommended. > # Explain the trade-off between read latency and off-heap compression > metadata size. > chunk_length_in_kb: 16 > > # List of available compressor: LZ4Compressor, SnappyCompressor, and > DeflateCompressor > # Explain trade-offs, some specific use cases (e.g. archives), etc. > class: 'LZ4Compressor' > > # If you want to disable compression by default, uncomment the following > line > #enabled: false > {code} > So instead of hard coded values we would end up with something like > TableConfig + TableDescriptor à la Config + DatabaseDescriptor. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (CASSANDRA-13066) Fast streaming with materialized views
[ https://issues.apache.org/jira/browse/CASSANDRA-13066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Roth reassigned CASSANDRA-13066: - Assignee: Benjamin Roth > Fast streaming with materialized views > -- > > Key: CASSANDRA-13066 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13066 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth >Assignee: Benjamin Roth > > I propose adding a configuration option to send streams of tables with MVs > not through the regular write path. > This may be either a global option or better a CF option. > Background: > A repair of a CF with an MV that is much out of sync creates many streams. > These streams all go through the regular write path to assert local > consistency of the MV. This again causes a read before write for every single > mutation which again puts a lot of pressure on the node - much more than > simply streaming the SSTable down. > In some cases this can be avoided. Instead of only repairing the base table, > all base + mv tables would have to be repaired. But this can break eventual > consistency between base table and MV. The proposed behaviour is always safe, > when having append-only MVs. It also works when using CL_QUORUM writes but it > cannot be absolutely guaranteed, that a quorum write is applied atomically, > so this can also lead to inconsistencies, if a quorum write is started but > one node dies in the middle of a request. > So, this proposal can help a lot in some situations but also can break > consistency in others. That's why it should be left upon the operator if that > behaviour is appropriate for individual use cases. > This issue came up here: > https://issues.apache.org/jira/browse/CASSANDRA-12888?focusedCommentId=15736599=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15736599 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13066) Fast streaming with materialized views
[ https://issues.apache.org/jira/browse/CASSANDRA-13066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Roth updated CASSANDRA-13066: -- Summary: Fast streaming with materialized views (was: Fast repair with materialized views) > Fast streaming with materialized views > -- > > Key: CASSANDRA-13066 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13066 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth > > I propose adding a configuration option to send streams of tables with MVs > not through the regular write path. > This may be either a global option or better a CF option. > Background: > A repair of a CF with an MV that is much out of sync creates many streams. > These streams all go through the regular write path to assert local > consistency of the MV. This again causes a read before write for every single > mutation which again puts a lot of pressure on the node - much more than > simply streaming the SSTable down. > In some cases this can be avoided. Instead of only repairing the base table, > all base + mv tables would have to be repaired. But this can break eventual > consistency between base table and MV. The proposed behaviour is always safe, > when having append-only MVs. It also works when using CL_QUORUM writes but it > cannot be absolutely guaranteed, that a quorum write is applied atomically, > so this can also lead to inconsistencies, if a quorum write is started but > one node dies in the middle of a request. > So, this proposal can help a lot in some situations but also can break > consistency in others. That's why it should be left upon the operator if that > behaviour is appropriate for individual use cases. > This issue came up here: > https://issues.apache.org/jira/browse/CASSANDRA-12888?focusedCommentId=15736599=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15736599 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths
[ https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Roth updated CASSANDRA-13065: -- Fix Version/s: 4.0 > Consistent range movements to not require MV updates to go through write > paths > --- > > Key: CASSANDRA-13065 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13065 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth >Assignee: Benjamin Roth >Priority: Critical > Fix For: 4.0 > > > Booting or decommisioning nodes with MVs is unbearably slow as all streams go > through the regular write paths. This causes read-before-writes for every > mutation and during bootstrap it causes them to be sent to batchlog. > The makes it virtually impossible to boot a new node in an acceptable amount > of time. > Using the regular streaming behaviour for consistent range movements works > much better in this case and does not break the MV local consistency contract. > Already tested on own cluster. > Bootstrap case is super easy to handle, decommission case requires > CASSANDRA-13064 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths
[ https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889675#comment-15889675 ] Benjamin Roth commented on CASSANDRA-13065: --- [~pauloricardomg] This is the follow-up to CASSANDRA-13064. I also optimized behaviour for CDC if no writepath is required due to MVs. This will allow incremental repairs for CFs with CDC without MVs. > Consistent range movements to not require MV updates to go through write > paths > --- > > Key: CASSANDRA-13065 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13065 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth >Assignee: Benjamin Roth >Priority: Critical > Fix For: 4.0 > > > Booting or decommisioning nodes with MVs is unbearably slow as all streams go > through the regular write paths. This causes read-before-writes for every > mutation and during bootstrap it causes them to be sent to batchlog. > The makes it virtually impossible to boot a new node in an acceptable amount > of time. > Using the regular streaming behaviour for consistent range movements works > much better in this case and does not break the MV local consistency contract. > Already tested on own cluster. > Bootstrap case is super easy to handle, decommission case requires > CASSANDRA-13064 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths
[ https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Roth updated CASSANDRA-13065: -- Status: Patch Available (was: Open) https://github.com/Jaumo/cassandra/commit/95a215e4f9c46e62580dcd4f638c80d3cf9716db > Consistent range movements to not require MV updates to go through write > paths > --- > > Key: CASSANDRA-13065 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13065 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth >Assignee: Benjamin Roth >Priority: Critical > > Booting or decommisioning nodes with MVs is unbearably slow as all streams go > through the regular write paths. This causes read-before-writes for every > mutation and during bootstrap it causes them to be sent to batchlog. > The makes it virtually impossible to boot a new node in an acceptable amount > of time. > Using the regular streaming behaviour for consistent range movements works > much better in this case and does not break the MV local consistency contract. > Already tested on own cluster. > Bootstrap case is super easy to handle, decommission case requires > CASSANDRA-13064 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13064) Add stream type or purpose to stream plan / stream
[ https://issues.apache.org/jira/browse/CASSANDRA-13064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889672#comment-15889672 ] Benjamin Roth commented on CASSANDRA-13064: --- [~pauloricardomg] Would you like to take a look at my patch? For a start I only replaced Stream descriptions by a discreet enum. It's the easiest refactoring to not break compatibility with existing serialization. If you want you can also take a look at the next commit which belongs to CASSANDRA-13065 > Add stream type or purpose to stream plan / stream > -- > > Key: CASSANDRA-13064 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13064 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth >Assignee: Benjamin Roth > Fix For: 4.0 > > > It would be very good to know the type or purpose of a certain stream on the > receiver side. It should be both available in a stream request and a stream > task. > Why? > It would be helpful to distinguish the purpose to allow different handling of > streams and requests. Examples: > - In stream request a global flush is done. This is not necessary for all > types of streams. A repair stream(-plan) does not require a flush as this has > been done shortly before in validation compaction and only the sstables that > have been validated also have to be streamed. > - In StreamReceiveTask streams for MVs go through the regular write path this > is painfully slow especially on bootstrap and decomission. Both for bootstrap > and decommission this is not necessary. Sstables can be directly streamed > down in this case. Handling bootstrap is no problem as it relies on a local > state but during decommission, the decom-state is bound to the sender and not > the receiver, so the receiver has to know that it is safe to stream that > sstable directly, not through the write-path. Thats why we have to know the > purpose of the stream. > I'd love to implement this on my own but I am not sure how not to break the > streaming protocol for backwards compat or if it is ok to do so. > Furthermore I'd love to get some feedback on that idea and some proposals > what stream types to distinguish. I could imagine: > - bootstrap > - decommission > - repair > - replace node > - remove node > - range relocation > Comments like this support my idea, knowing the purpose could avoid this. > {quote} > // TODO each call to transferRanges re-flushes, this is > potentially a lot of waste > streamPlan.transferRanges(newEndpoint, preferred, > keyspaceName, ranges); > {quote} > And alternative to passing the purpose of the stream was to pass flags like: > - requiresFlush > - requiresWritePathForMaterializedView > ... > I guess passing the purpose will make the streaming protocol more robust for > future changes and leaves decisions up to the receiver. > But an additional "requiresFlush" would also avoid putting too much logic > into the streaming code. The streaming code should not care about purposes, > the caller or receiver should. So the decision if a stream requires as flush > before stream should be up to the stream requester and the stream request > receiver depending on the purpose of the stream. > I'm excited about your feedback :) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13064) Add stream type or purpose to stream plan / stream
[ https://issues.apache.org/jira/browse/CASSANDRA-13064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Roth updated CASSANDRA-13064: -- Fix Version/s: 4.0 Status: Patch Available (was: Open) https://github.com/Jaumo/cassandra/commit/4189c949336f3c7e4ba25da80fdd7da5faa2ea65 > Add stream type or purpose to stream plan / stream > -- > > Key: CASSANDRA-13064 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13064 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth >Assignee: Benjamin Roth > Fix For: 4.0 > > > It would be very good to know the type or purpose of a certain stream on the > receiver side. It should be both available in a stream request and a stream > task. > Why? > It would be helpful to distinguish the purpose to allow different handling of > streams and requests. Examples: > - In stream request a global flush is done. This is not necessary for all > types of streams. A repair stream(-plan) does not require a flush as this has > been done shortly before in validation compaction and only the sstables that > have been validated also have to be streamed. > - In StreamReceiveTask streams for MVs go through the regular write path this > is painfully slow especially on bootstrap and decomission. Both for bootstrap > and decommission this is not necessary. Sstables can be directly streamed > down in this case. Handling bootstrap is no problem as it relies on a local > state but during decommission, the decom-state is bound to the sender and not > the receiver, so the receiver has to know that it is safe to stream that > sstable directly, not through the write-path. Thats why we have to know the > purpose of the stream. > I'd love to implement this on my own but I am not sure how not to break the > streaming protocol for backwards compat or if it is ok to do so. > Furthermore I'd love to get some feedback on that idea and some proposals > what stream types to distinguish. I could imagine: > - bootstrap > - decommission > - repair > - replace node > - remove node > - range relocation > Comments like this support my idea, knowing the purpose could avoid this. > {quote} > // TODO each call to transferRanges re-flushes, this is > potentially a lot of waste > streamPlan.transferRanges(newEndpoint, preferred, > keyspaceName, ranges); > {quote} > And alternative to passing the purpose of the stream was to pass flags like: > - requiresFlush > - requiresWritePathForMaterializedView > ... > I guess passing the purpose will make the streaming protocol more robust for > future changes and leaves decisions up to the receiver. > But an additional "requiresFlush" would also avoid putting too much logic > into the streaming code. The streaming code should not care about purposes, > the caller or receiver should. So the decision if a stream requires as flush > before stream should be up to the stream requester and the stream request > receiver depending on the purpose of the stream. > I'm excited about your feedback :) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb
[ https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889653#comment-15889653 ] Benjamin Roth commented on CASSANDRA-13241: --- I thought of 2 arrays because a semantic meaning (position vs chunk size) and a single alignment (8, 3, 2 byte) for each could be easier to understand and to maintain. Of course it works either way. With 2 arrays, you could still "pull sections", it's just a single fetch more to get the 8 byte absolute offset. Loop summing vs. "relative-absolute offset": At the end this is always a tradeoff between mem/cpu. I personally am not the one who fights for every single byte in this case. But I also think some CPU cycles more to sum a bunch of ints is still bearable. I guess if I had to decide, I'd give "loop summing" a try. Any different opinions? Do you mean a ChunkCache cache miss? Sorry for that kind of questions. I never came across this part of the code. > Lower default chunk_length_in_kb from 64kb to 4kb > - > > Key: CASSANDRA-13241 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13241 > Project: Cassandra > Issue Type: Wish > Components: Core >Reporter: Benjamin Roth > > Having a too low chunk size may result in some wasted disk space. A too high > chunk size may lead to massive overreads and may have a critical impact on > overall system performance. > In my case, the default chunk size lead to peak read IOs of up to 1GB/s and > avg reads of 200MB/s. After lowering chunksize (of course aligned with read > ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s. > The risk of (physical) overreads is increasing with lower (page cache size) / > (total data size) ratio. > High chunk sizes are mostly appropriate for bigger payloads pre request but > if the model consists rather of small rows or small resultsets, the read > overhead with 64kb chunk size is insanely high. This applies for example for > (small) skinny rows. > Please also see here: > https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY > To give you some insights what a difference it can make (460GB data, 128GB > RAM): > - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L > - Disk throughput: https://cl.ly/2a0Z250S1M3c > - This shows, that the request distribution remained the same, so no "dynamic > snitch magic": https://cl.ly/3E0t1T1z2c0J -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb
[ https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888911#comment-15888911 ] Benjamin Roth commented on CASSANDRA-13241: --- How about this: You create 2 chunk lookup tables. One with absolute pointers (long, 8 byte). A second one with relative pointers or chunk-sizes - 2 bytes are enough for up to 64kb chunks. You store an absolute pointer for every $x chunks (1000 in this example). So you can get the absolute offset looking up the absolute position with $idx = ($pos - ($pos % 100)) / $x Then you iterate through the size lookup from ($pos - ($pos % 100)) to $pos - 1. A fallback can be provided for chunks >64kb. Either relative pointers are completely avoided or are increased to 3 bytes. There you go. Payload of 1 TB = 1024 * 1024 * 1024kb CS 64 (NOW): chunks = 1024 * 1024 * 1024kb / 64kb = 16777216 (16M) compression = 1.99 compressed_size = 1024 * 1024 * 1024kb / 1.99 = 539568756kb kernel_pages = 134892189 absolute_pointer_size = 8 * chunks = 134217728 (128MB) kernel_page_size = 134892189 * 8 (1029 MB) total_size = 1157MB CS 4 with relative positions chunks = 1024 * 1024 * 1024kb / 4kb = 268435456 (256M) compression = 1.75 compressed_size = 1024 * 1024 * 1024kb / 1.75 = 613566757kb kernel_pages = 153391689 absolute_pointer_size = 8 * chunks / 1000 = 2147484 (2 MB) relative_pointer_size = 2 * chunks = 536870912 (512 MB) kernel_page_size = 153391689 * 8 = 1227133512 (1170MB) total_size = 1684MB increase = 45% => Reduces memory overhead when reducing chunk size from 64kb to 4kb from the initially mentioned 800% to 45% when you also take kernel structs into account which are also of a relevant size - even more than the initially discussed "128M" for 64kb chunks Pro: A lot less memory required Con: Some CPU overhead. But is this really relevant compared to decompressing 4kb or even 64kb? P.S.: Kernel memory calculation is based on the 8 bytes [~aweisberg] has researched. Compression ratios are taken from the percona blog. > Lower default chunk_length_in_kb from 64kb to 4kb > - > > Key: CASSANDRA-13241 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13241 > Project: Cassandra > Issue Type: Wish > Components: Core >Reporter: Benjamin Roth > > Having a too low chunk size may result in some wasted disk space. A too high > chunk size may lead to massive overreads and may have a critical impact on > overall system performance. > In my case, the default chunk size lead to peak read IOs of up to 1GB/s and > avg reads of 200MB/s. After lowering chunksize (of course aligned with read > ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s. > The risk of (physical) overreads is increasing with lower (page cache size) / > (total data size) ratio. > High chunk sizes are mostly appropriate for bigger payloads pre request but > if the model consists rather of small rows or small resultsets, the read > overhead with 64kb chunk size is insanely high. This applies for example for > (small) skinny rows. > Please also see here: > https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY > To give you some insights what a difference it can make (460GB data, 128GB > RAM): > - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L > - Disk throughput: https://cl.ly/2a0Z250S1M3c > - This shows, that the request distribution remained the same, so no "dynamic > snitch magic": https://cl.ly/3E0t1T1z2c0J -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (CASSANDRA-13064) Add stream type or purpose to stream plan / stream
[ https://issues.apache.org/jira/browse/CASSANDRA-13064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Roth reassigned CASSANDRA-13064: - Assignee: Benjamin Roth > Add stream type or purpose to stream plan / stream > -- > > Key: CASSANDRA-13064 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13064 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth >Assignee: Benjamin Roth > > It would be very good to know the type or purpose of a certain stream on the > receiver side. It should be both available in a stream request and a stream > task. > Why? > It would be helpful to distinguish the purpose to allow different handling of > streams and requests. Examples: > - In stream request a global flush is done. This is not necessary for all > types of streams. A repair stream(-plan) does not require a flush as this has > been done shortly before in validation compaction and only the sstables that > have been validated also have to be streamed. > - In StreamReceiveTask streams for MVs go through the regular write path this > is painfully slow especially on bootstrap and decomission. Both for bootstrap > and decommission this is not necessary. Sstables can be directly streamed > down in this case. Handling bootstrap is no problem as it relies on a local > state but during decommission, the decom-state is bound to the sender and not > the receiver, so the receiver has to know that it is safe to stream that > sstable directly, not through the write-path. Thats why we have to know the > purpose of the stream. > I'd love to implement this on my own but I am not sure how not to break the > streaming protocol for backwards compat or if it is ok to do so. > Furthermore I'd love to get some feedback on that idea and some proposals > what stream types to distinguish. I could imagine: > - bootstrap > - decommission > - repair > - replace node > - remove node > - range relocation > Comments like this support my idea, knowing the purpose could avoid this. > {quote} > // TODO each call to transferRanges re-flushes, this is > potentially a lot of waste > streamPlan.transferRanges(newEndpoint, preferred, > keyspaceName, ranges); > {quote} > And alternative to passing the purpose of the stream was to pass flags like: > - requiresFlush > - requiresWritePathForMaterializedView > ... > I guess passing the purpose will make the streaming protocol more robust for > future changes and leaves decisions up to the receiver. > But an additional "requiresFlush" would also avoid putting too much logic > into the streaming code. The streaming code should not care about purposes, > the caller or receiver should. So the decision if a stream requires as flush > before stream should be up to the stream requester and the stream request > receiver depending on the purpose of the stream. > I'm excited about your feedback :) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13279) Table default settings file
[ https://issues.apache.org/jira/browse/CASSANDRA-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888112#comment-15888112 ] Benjamin Roth commented on CASSANDRA-13279: --- Maybe it was a bit misleading. I am not defending a new source per se. I am simply 'pro' improving docs by adding problem/solution centric resources in a place that can easily be found by anyone. E.g. if I google for "Cassandra performance tuning", the first match should go to an official guide. I'd love to volunteer but first I'd like to work on MVs which I am deferring since end of '16. But if there is a consensus of a possible structure and if I have access to the docs I am happy to add content whenever I feel like. > Table default settings file > --- > > Key: CASSANDRA-13279 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13279 > Project: Cassandra > Issue Type: Wish > Components: Configuration >Reporter: Romain Hardouin >Priority: Minor > Labels: config, documentation > > Following CASSANDRA-13241 we often see that there is no one-size-fits-all > value for settings. We can't find a sweet spot for every use cases. > It's true for settings in cassandra.yaml but as [~brstgt] said for > {{chunk_length_in_kb}}: "this is somewhat hidden for the average user". > Many table settings are somewhat hidden for the average user. Some people > will think RTFM but if a file - say tables.yaml - contains default values for > table settings, more people would pay attention to them. And of course this > file could contain useful comments and guidance. > Example with SSTable compression options: > {code} > # General comments about sstable compression > compression: > # First of all: explain what is it. We split each SSTable into chunks, > etc. > # Explain when users should lower this value (e.g. 4) or when a higher > value like 64 or 128 are recommended. > # Explain the trade-off between read latency and off-heap compression > metadata size. > chunk_length_in_kb: 16 > > # List of available compressor: LZ4Compressor, SnappyCompressor, and > DeflateCompressor > # Explain trade-offs, some specific use cases (e.g. archives), etc. > class: 'LZ4Compressor' > > # If you want to disable compression by default, uncomment the following > line > #enabled: false > {code} > So instead of hard coded values we would end up with something like > TableConfig + TableDescriptor à la Config + DatabaseDescriptor. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths
[ https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Roth reassigned CASSANDRA-13065: - Assignee: Benjamin Roth > Consistent range movements to not require MV updates to go through write > paths > --- > > Key: CASSANDRA-13065 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13065 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth >Assignee: Benjamin Roth >Priority: Critical > > Booting or decommisioning nodes with MVs is unbearably slow as all streams go > through the regular write paths. This causes read-before-writes for every > mutation and during bootstrap it causes them to be sent to batchlog. > The makes it virtually impossible to boot a new node in an acceptable amount > of time. > Using the regular streaming behaviour for consistent range movements works > much better in this case and does not break the MV local consistency contract. > Already tested on own cluster. > Bootstrap case is super easy to handle, decommission case requires > CASSANDRA-13064 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13279) Table default settings file
[ https://issues.apache.org/jira/browse/CASSANDRA-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888084#comment-15888084 ] Benjamin Roth commented on CASSANDRA-13279: --- I can understand your consideration about the deployment issues of centralized settings in a non-centralized settings file. But I have contradict in the second point. By "somewhat hidden" I don't mean it does not exist but an average user won't come across the documentation or the valueable information (why should I tweak that) that is related to it. It is very difficult to find the right resource / doc in the CS ecosystem. There is datastax, there is the official CS site (which contains a lot of TODOs and empty pages), wiki.apache.org (looks very outdated) and there are zillions of distributed and spread resources like blogs all over the net. Finding the right information (as a new user) is the famous needle in the haystack. You are a user / developer from the early ages and know every corner of the CS universe but for new users it is hardly overseeable and 'somewhat hidden'. To be honest: When I first installed and tested CS, I was totally lost. I had to test a lot, read many many many different resources, go through the hell of trial and error, analyzing, debugging, compiling and testing again with a lot of pain to get the knowledge I have to day. Tweaking chunk_size was quite the same. I tried a lot of stuff, posted on lists, ... and after some days I was like "Wait, there was this setting in DevCenter with that 'chunk_size', what does it exactly do and what happens if ... AH it works!". How about creating a structure in the official cassandra docs with use cases and Q for performance tuning? Sth. like a structured version of Al Tobeys tuning guide with a Problem > Solution section. > Table default settings file > --- > > Key: CASSANDRA-13279 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13279 > Project: Cassandra > Issue Type: Wish > Components: Configuration >Reporter: Romain Hardouin >Priority: Minor > Labels: config, documentation > > Following CASSANDRA-13241 we often see that there is no one-size-fits-all > value for settings. We can't find a sweet spot for every use cases. > It's true for settings in cassandra.yaml but as [~brstgt] said for > {{chunk_length_in_kb}}: "this is somewhat hidden for the average user". > Many table settings are somewhat hidden for the average user. Some people > will think RTFM but if a file - say tables.yaml - contains default values for > table settings, more people would pay attention to them. And of course this > file could contain useful comments and guidance. > Example with SSTable compression options: > {code} > # General comments about sstable compression > compression: > # First of all: explain what is it. We split each SSTable into chunks, > etc. > # Explain when users should lower this value (e.g. 4) or when a higher > value like 64 or 128 are recommended. > # Explain the trade-off between read latency and off-heap compression > metadata size. > chunk_length_in_kb: 16 > > # List of available compressor: LZ4Compressor, SnappyCompressor, and > DeflateCompressor > # Explain trade-offs, some specific use cases (e.g. archives), etc. > class: 'LZ4Compressor' > > # If you want to disable compression by default, uncomment the following > line > #enabled: false > {code} > So instead of hard coded values we would end up with something like > TableConfig + TableDescriptor à la Config + DatabaseDescriptor. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (CASSANDRA-13226) StreamPlan for incremental repairs flushing memtables unnecessarily
[ https://issues.apache.org/jira/browse/CASSANDRA-13226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887961#comment-15887961 ] Benjamin Roth edited comment on CASSANDRA-13226 at 2/28/17 1:12 PM: Sorry for that many comments, just another thought: Flushes can be optimized very easily in that way that a flush is only executed if the memtable contains mutations for the requested range OR if the memtable exceeds a certain size, so that the check is still cheap. I implemented this just for fun some months ago but did never create a ticket for it. See patch here https://github.com/Jaumo/cassandra/commit/983514b0d3e15cea042533273ead5ea33c00bacf Just saw it also disabled pre-repair flush as proposed before. was (Author: brstgt): Sorry for that many comments, just another thought: Flushes can be optimized very easily in that way that a flush is only executed if the memtable contains mutations for the requested range OR if the memtable exceeds a certain size, so that the check is still cheap. I implemented this just for fun some months ago but did never create a ticket for it. > StreamPlan for incremental repairs flushing memtables unnecessarily > --- > > Key: CASSANDRA-13226 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13226 > Project: Cassandra > Issue Type: Bug >Reporter: Blake Eggleston >Assignee: Blake Eggleston >Priority: Minor > Fix For: 4.0 > > > Since incremental repairs are run against a fixed dataset, there's no need to > flush memtables when streaming for them. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13226) StreamPlan for incremental repairs flushing memtables unnecessarily
[ https://issues.apache.org/jira/browse/CASSANDRA-13226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887961#comment-15887961 ] Benjamin Roth commented on CASSANDRA-13226: --- Sorry for that many comments, just another thought: Flushes can be optimized very easily in that way that a flush is only executed if the memtable contains mutations for the requested range OR if the memtable exceeds a certain size, so that the check is still cheap. I implemented this just for fun some months ago but did never create a ticket for it. > StreamPlan for incremental repairs flushing memtables unnecessarily > --- > > Key: CASSANDRA-13226 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13226 > Project: Cassandra > Issue Type: Bug >Reporter: Blake Eggleston >Assignee: Blake Eggleston >Priority: Minor > Fix For: 4.0 > > > Since incremental repairs are run against a fixed dataset, there's no need to > flush memtables when streaming for them. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13226) StreamPlan for incremental repairs flushing memtables unnecessarily
[ https://issues.apache.org/jira/browse/CASSANDRA-13226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887945#comment-15887945 ] Benjamin Roth commented on CASSANDRA-13226: --- I am referring to this "stacktrace": RepairMessageVerbHandler.doVerb (case VALIDATION_REQUEST) CompactionManager.instance.submitValidation(store, validator) CompactionManager.doValidationCompaction => StorageService.instance.forceKeyspaceFlush After that merkle trees are calculated and based on that streams are triggered. Thats why all data that is electable for transfer has already been flushed. Also avoiding a flush locally is only the half way. Streams REQUESTED by a stream plan also cause a flush on the sender side. But that sender also has already validated (and so flushed) the requested data. Maybe I missed sth but from what I can see, a REPAIR stream never requires a flush. > StreamPlan for incremental repairs flushing memtables unnecessarily > --- > > Key: CASSANDRA-13226 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13226 > Project: Cassandra > Issue Type: Bug >Reporter: Blake Eggleston >Assignee: Blake Eggleston >Priority: Minor > Fix For: 4.0 > > > Since incremental repairs are run against a fixed dataset, there's no need to > flush memtables when streaming for them. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13279) Table default settings file
[ https://issues.apache.org/jira/browse/CASSANDRA-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887733#comment-15887733 ] Benjamin Roth commented on CASSANDRA-13279: --- Great idea! +1 > Table default settings file > --- > > Key: CASSANDRA-13279 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13279 > Project: Cassandra > Issue Type: Wish > Components: Configuration >Reporter: Romain Hardouin >Priority: Minor > Labels: config, documentation > > Following CASSANDRA-13241 we often see that there is no one-size-fits-all > value for settings. We can't find a sweet spot for every use cases. > It's true for settings in cassandra.yaml but as [~brstgt] said for > {{chunk_length_in_kb}}: "this is somewhat hidden for the average user". > Many table settings are somewhat hidden for the average user. Some people > will think RTFM but if a file - say tables.yaml - contains default values for > table settings, more people would pay attention to them. And of course this > file could contain useful comments and guidance. > Example with SSTable compression options: > {code} > # General comments about sstable compression > compression: > # First of all: explain what is it. We split each SSTable into chunks, > etc. > # Explain when users should lower this value (e.g. 4) or when a higher > value like 64 or 128 are recommended. > # Explain the trade-off between read latency and off-heap compression > metadata size. > chunk_length_in_kb: 16 > > # List of available compressor: LZ4Compressor, SnappyCompressor, and > DeflateCompressor > # Explain trade-offs, some specific use cases (e.g. archives), etc. > class: 'LZ4Compressor' > > # If you want to disable compression by default, uncomment the following > line > #enabled: false > {code} > So instead of hard coded values we would end up with something like > TableConfig + TableDescriptor à la Config + DatabaseDescriptor. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13226) StreamPlan for incremental repairs flushing memtables unnecessarily
[ https://issues.apache.org/jira/browse/CASSANDRA-13226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887654#comment-15887654 ] Benjamin Roth commented on CASSANDRA-13226: --- Isn't this also true for non-incremental repairs? Merkle tree calculation also triggers a flush and any repair begins with a merkle tree. So there is no need to flush as the inconsistent dataset to be streamed for repair is always contained in SSTables flushed by MT calculation before. > StreamPlan for incremental repairs flushing memtables unnecessarily > --- > > Key: CASSANDRA-13226 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13226 > Project: Cassandra > Issue Type: Bug >Reporter: Blake Eggleston >Assignee: Blake Eggleston >Priority: Minor > Fix For: 4.0 > > > Since incremental repairs are run against a fixed dataset, there's no need to > flush memtables when streaming for them. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb
[ https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887602#comment-15887602 ] Benjamin Roth edited comment on CASSANDRA-13241 at 2/28/17 8:58 AM: Just thinking about Jeffs + Bens comments: Even if you have 4 TB of data and 32GB RAM 4KB might help. In that (extreme) case, you'd steal ~8GB from page cache for "chunk tables". These 8GB probably would have helped a fraction of nothing when used as page cache if you look at the RAM/Load ratio. Most probably the PC would be totally ineffective, if you don't have a very, very low percentage of hot data. So the probability that nearly every read results in a physical IO is very high. So in that case lowering the chunk size to 4KB would at least save you from immense overread and help the SSDs to survive that situation. That said, I see only one REAL problem: If you have more chunk-offset data than fits in your memory. But in that case my answer would simply be: Get more RAM. There are certain mininum requirements you MUST fulfill. The imagination of running a node with many TBs of data with less than say 16-32GB is simply insane from any perspective. Nevertheless optimizing the memory usage of chunk-offset lookup would be a big deal either. was (Author: brstgt): Just thinking about Jeffs + Bens comments: Even if you have 4 TB of data and 32GB RAM 4KB might help. In that (extreme) case, you'd steal ~8GB from page cache for "chunk tables". These 8GB probably would have helped a fraction of nothing when used as page cache if you look at the RAM/Load ratio. Most probably the PC would be totally ineffective, if you don't have a very, very low percentage of hot data. So the probability that nearly every read results in a physical IO is very high. So in that case lowering the chunk size to 4KB would at least save you from immense overread and help the SSDs to survive that situation. That said, I see only one REAL problem: If you have more chunk-offset data than fits in your memory. But in that case my answer would simply be: Get more RAM. There are certain mininum requirements you MUST fulfill. The imagination of running a node with many TBs of data with less than say 16-32GB is simply insane from all kinds of perspective. Nevertheless optimizing the memory usage of chunk-offset lookup would be a big deal either. > Lower default chunk_length_in_kb from 64kb to 4kb > - > > Key: CASSANDRA-13241 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13241 > Project: Cassandra > Issue Type: Wish > Components: Core >Reporter: Benjamin Roth > > Having a too low chunk size may result in some wasted disk space. A too high > chunk size may lead to massive overreads and may have a critical impact on > overall system performance. > In my case, the default chunk size lead to peak read IOs of up to 1GB/s and > avg reads of 200MB/s. After lowering chunksize (of course aligned with read > ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s. > The risk of (physical) overreads is increasing with lower (page cache size) / > (total data size) ratio. > High chunk sizes are mostly appropriate for bigger payloads pre request but > if the model consists rather of small rows or small resultsets, the read > overhead with 64kb chunk size is insanely high. This applies for example for > (small) skinny rows. > Please also see here: > https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY > To give you some insights what a difference it can make (460GB data, 128GB > RAM): > - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L > - Disk throughput: https://cl.ly/2a0Z250S1M3c > - This shows, that the request distribution remained the same, so no "dynamic > snitch magic": https://cl.ly/3E0t1T1z2c0J -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb
[ https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887602#comment-15887602 ] Benjamin Roth commented on CASSANDRA-13241: --- Just thinking about Jeffs + Bens comments: Even if you have 4 TB of data and 32GB RAM 4KB might help. In that (extreme) case, you'd steal ~8GB from page cache for "chunk tables". These 8GB probably would have helped a fraction of nothing when used as page cache if you look at the RAM/Load ratio. Most probably the PC would be totally ineffective, if you don't have a very, very low percentage of hot data. So the probability that nearly every read results in a physical IO is very high. So in that case lowering the chunk size to 4KB would at least save you from immense overread and help the SSDs to survive that situation. That said, I see only one REAL problem: If you have more chunk-offset data than fits in your memory. But in that case my answer would simply be: Get more RAM. There are certain mininum requirements you MUST fulfill. The imagination of running a node with many TBs of data with less than say 16-32GB is simply insane from all kinds of perspective. Nevertheless optimizing the memory usage of chunk-offset lookup would be a big deal either. > Lower default chunk_length_in_kb from 64kb to 4kb > - > > Key: CASSANDRA-13241 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13241 > Project: Cassandra > Issue Type: Wish > Components: Core >Reporter: Benjamin Roth > > Having a too low chunk size may result in some wasted disk space. A too high > chunk size may lead to massive overreads and may have a critical impact on > overall system performance. > In my case, the default chunk size lead to peak read IOs of up to 1GB/s and > avg reads of 200MB/s. After lowering chunksize (of course aligned with read > ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s. > The risk of (physical) overreads is increasing with lower (page cache size) / > (total data size) ratio. > High chunk sizes are mostly appropriate for bigger payloads pre request but > if the model consists rather of small rows or small resultsets, the read > overhead with 64kb chunk size is insanely high. This applies for example for > (small) skinny rows. > Please also see here: > https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY > To give you some insights what a difference it can make (460GB data, 128GB > RAM): > - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L > - Disk throughput: https://cl.ly/2a0Z250S1M3c > - This shows, that the request distribution remained the same, so no "dynamic > snitch magic": https://cl.ly/3E0t1T1z2c0J -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb
[ https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887451#comment-15887451 ] Benjamin Roth commented on CASSANDRA-13241: --- [~aweisberg] I didn't really get the point of your comment. Would you like to explain? [~jjirsa] I understand you consideration. A default value should avoid worst cases for most or all people and not optimize one case. So maybe yes, we could choose sth in between. Do you see a way to offer a recommendation to users similar to the comments of cassandra.yaml. IMHO this table option is somewhat hidden for the average user but may have a huge impact on your overall server load and your latency. > Lower default chunk_length_in_kb from 64kb to 4kb > - > > Key: CASSANDRA-13241 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13241 > Project: Cassandra > Issue Type: Wish > Components: Core >Reporter: Benjamin Roth > > Having a too low chunk size may result in some wasted disk space. A too high > chunk size may lead to massive overreads and may have a critical impact on > overall system performance. > In my case, the default chunk size lead to peak read IOs of up to 1GB/s and > avg reads of 200MB/s. After lowering chunksize (of course aligned with read > ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s. > The risk of (physical) overreads is increasing with lower (page cache size) / > (total data size) ratio. > High chunk sizes are mostly appropriate for bigger payloads pre request but > if the model consists rather of small rows or small resultsets, the read > overhead with 64kb chunk size is insanely high. This applies for example for > (small) skinny rows. > Please also see here: > https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY > To give you some insights what a difference it can make (460GB data, 128GB > RAM): > - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L > - Disk throughput: https://cl.ly/2a0Z250S1M3c > - This shows, that the request distribution remained the same, so no "dynamic > snitch magic": https://cl.ly/3E0t1T1z2c0J -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb
[ https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Roth updated CASSANDRA-13241: -- Hm. I read recommendations that a single node should not have a load of more than 1-2 TB per node. And I read recommendations of having at least 128gb RAM. If I pay 2gb for a recommended max load to have a MUCH better performance on uncached io that makes more than 80% of recommended sizings (on equally hot data) it seems a quite fair price to me. If there is much less hot data it probably still works as you only deal page cache for faster io. The fewer hot data the fewer page cache is required. Did I miss a point? Btw 4kb worked perfectly for me with 460gb load/128gb RAM. 64kb did not work well. Really. > Lower default chunk_length_in_kb from 64kb to 4kb > - > > Key: CASSANDRA-13241 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13241 > Project: Cassandra > Issue Type: Wish > Components: Core >Reporter: Benjamin Roth > > Having a too low chunk size may result in some wasted disk space. A too high > chunk size may lead to massive overreads and may have a critical impact on > overall system performance. > In my case, the default chunk size lead to peak read IOs of up to 1GB/s and > avg reads of 200MB/s. After lowering chunksize (of course aligned with read > ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s. > The risk of (physical) overreads is increasing with lower (page cache size) / > (total data size) ratio. > High chunk sizes are mostly appropriate for bigger payloads pre request but > if the model consists rather of small rows or small resultsets, the read > overhead with 64kb chunk size is insanely high. This applies for example for > (small) skinny rows. > Please also see here: > https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY > To give you some insights what a difference it can make (460GB data, 128GB > RAM): > - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L > - Disk throughput: https://cl.ly/2a0Z250S1M3c > - This shows, that the request distribution remained the same, so no "dynamic > snitch magic": https://cl.ly/3E0t1T1z2c0J -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb
[ https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15886536#comment-15886536 ] Benjamin Roth commented on CASSANDRA-13241: --- According to percona (https://www.percona.com/blog/2016/03/09/evaluating-database-compression-methods/) and my own experience, the impact on compression ratio is not that big with lz4. Can the increased offheap requirements be expressed in a formula? > Lower default chunk_length_in_kb from 64kb to 4kb > - > > Key: CASSANDRA-13241 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13241 > Project: Cassandra > Issue Type: Wish > Components: Core >Reporter: Benjamin Roth > > Having a too low chunk size may result in some wasted disk space. A too high > chunk size may lead to massive overreads and may have a critical impact on > overall system performance. > In my case, the default chunk size lead to peak read IOs of up to 1GB/s and > avg reads of 200MB/s. After lowering chunksize (of course aligned with read > ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s. > The risk of (physical) overreads is increasing with lower (page cache size) / > (total data size) ratio. > High chunk sizes are mostly appropriate for bigger payloads pre request but > if the model consists rather of small rows or small resultsets, the read > overhead with 64kb chunk size is insanely high. This applies for example for > (small) skinny rows. > Please also see here: > https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY > To give you some insights what a difference it can make (460GB data, 128GB > RAM): > - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L > - Disk throughput: https://cl.ly/2a0Z250S1M3c > - This shows, that the request distribution remained the same, so no "dynamic > snitch magic": https://cl.ly/3E0t1T1z2c0J -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb
[ https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15886517#comment-15886517 ] Benjamin Roth commented on CASSANDRA-13241: --- No worries. Your patch answered my questions implicitly. Thanks! > Lower default chunk_length_in_kb from 64kb to 4kb > - > > Key: CASSANDRA-13241 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13241 > Project: Cassandra > Issue Type: Wish > Components: Core >Reporter: Benjamin Roth > > Having a too low chunk size may result in some wasted disk space. A too high > chunk size may lead to massive overreads and may have a critical impact on > overall system performance. > In my case, the default chunk size lead to peak read IOs of up to 1GB/s and > avg reads of 200MB/s. After lowering chunksize (of course aligned with read > ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s. > The risk of (physical) overreads is increasing with lower (page cache size) / > (total data size) ratio. > High chunk sizes are mostly appropriate for bigger payloads pre request but > if the model consists rather of small rows or small resultsets, the read > overhead with 64kb chunk size is insanely high. This applies for example for > (small) skinny rows. > Please also see here: > https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY > To give you some insights what a difference it can make (460GB data, 128GB > RAM): > - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L > - Disk throughput: https://cl.ly/2a0Z250S1M3c > - This shows, that the request distribution remained the same, so no "dynamic > snitch magic": https://cl.ly/3E0t1T1z2c0J -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb
[ https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15886402#comment-15886402 ] Benjamin Roth commented on CASSANDRA-13241: --- Thanks for your vote, but ... maybe this is a stupid question: Who will finally decide if that change is accepted? I think I could make a patch pretty easily but how does change management work? > Lower default chunk_length_in_kb from 64kb to 4kb > - > > Key: CASSANDRA-13241 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13241 > Project: Cassandra > Issue Type: Wish > Components: Core >Reporter: Benjamin Roth > > Having a too low chunk size may result in some wasted disk space. A too high > chunk size may lead to massive overreads and may have a critical impact on > overall system performance. > In my case, the default chunk size lead to peak read IOs of up to 1GB/s and > avg reads of 200MB/s. After lowering chunksize (of course aligned with read > ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s. > The risk of (physical) overreads is increasing with lower (page cache size) / > (total data size) ratio. > High chunk sizes are mostly appropriate for bigger payloads pre request but > if the model consists rather of small rows or small resultsets, the read > overhead with 64kb chunk size is insanely high. This applies for example for > (small) skinny rows. > Please also see here: > https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY > To give you some insights what a difference it can make (460GB data, 128GB > RAM): > - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L > - Disk throughput: https://cl.ly/2a0Z250S1M3c > - This shows, that the request distribution remained the same, so no "dynamic > snitch magic": https://cl.ly/3E0t1T1z2c0J -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb
[ https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878185#comment-15878185 ] Benjamin Roth commented on CASSANDRA-13241: --- Thanks for your comment. Of course there is no perfect match for all cases. IMHO the default value should more avoid the worst negative impacts for most or all cases than bringing great results for some use cases. I personally use 4KB with >450GB data on a 128GB (12GB JVM heap) machine and the situation improved A LOT. We also have tables with >10M partitions and I didn't see any problems until now. If someone has a better proposal and maybe an explanation, why not. > Lower default chunk_length_in_kb from 64kb to 4kb > - > > Key: CASSANDRA-13241 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13241 > Project: Cassandra > Issue Type: Wish > Components: Core >Reporter: Benjamin Roth > > Having a too low chunk size may result in some wasted disk space. A too high > chunk size may lead to massive overreads and may have a critical impact on > overall system performance. > In my case, the default chunk size lead to peak read IOs of up to 1GB/s and > avg reads of 200MB/s. After lowering chunksize (of course aligned with read > ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s. > The risk of (physical) overreads is increasing with lower (page cache size) / > (total data size) ratio. > High chunk sizes are mostly appropriate for bigger payloads pre request but > if the model consists rather of small rows or small resultsets, the read > overhead with 64kb chunk size is insanely high. This applies for example for > (small) skinny rows. > Please also see here: > https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY > To give you some insights what a difference it can make (460GB data, 128GB > RAM): > - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L > - Disk throughput: https://cl.ly/2a0Z250S1M3c > - This shows, that the request distribution remained the same, so no "dynamic > snitch magic": https://cl.ly/3E0t1T1z2c0J -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb
Benjamin Roth created CASSANDRA-13241: - Summary: Lower default chunk_length_in_kb from 64kb to 4kb Key: CASSANDRA-13241 URL: https://issues.apache.org/jira/browse/CASSANDRA-13241 Project: Cassandra Issue Type: Wish Components: Core Reporter: Benjamin Roth Having a too low chunk size may result in some wasted disk space. A too high chunk size may lead to massive overreads and may have a critical impact on overall system performance. In my case, the default chunk size lead to peak read IOs of up to 1GB/s and avg reads of 200MB/s. After lowering chunksize (of course aligned with read ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s. The risk of (physical) overreads is increasing with lower (page cache size) / (total data size) ratio. High chunk sizes are mostly appropriate for bigger payloads pre request but if the model consists rather of small rows or small resultsets, the read overhead with 64kb chunk size is insanely high. This applies for example for (small) skinny rows. Please also see here: https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY To give you some insights what a difference it can make (460GB data, 128GB RAM): - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L - Disk throughput: https://cl.ly/2a0Z250S1M3c - This shows, that the request distribution remained the same, so no "dynamic snitch magic": https://cl.ly/3E0t1T1z2c0J -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC
[ https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15821204#comment-15821204 ] Benjamin Roth commented on CASSANDRA-12888: --- Hi Victor, 1. Performance: Performance can be better with MV than with batches but this depends on the read performance of the base table vs. the performance overhead for batches, which also is dependent on the batch size and the batchlog performance. An MV always creates a read before write, so it depends much on this how the MV performs. The final write operation of the MV update is fast as it works like a regular (local) write. 2. Partition Keys and remote MV updates You are of course right, that this may be a common use case. You have to use it carefully. Maybe the situation already has improved by some bugfixes. The last time I tried was some months ago. To be fair I have to mention that back then there was a bug with a race condition that could deadlock the whole mutation stage. With "remote MVs" we ran very frequently into this situation during bootstraps (for example). This has to do with MV-locks and probably the much longer lock-time when the MV update is remote, leading to more lock-contention. With remote MV updates, the current write request also depends on the performance of remote nodes. This can lead to write timeouts much faster as long as the (remote) MV update is part of the write request and not deferred. So again: Maybe this situation has improved meanwhile but I personally didn't require it so I was able to use normal tables to "twist" the PK. We currently use MVs only to add a field to the primary key for sorting. > Incremental repairs broken for MVs and CDC > -- > > Key: CASSANDRA-12888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12888 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Stefan Podkowinski >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.0.x, 3.x > > > SSTables streamed during the repair process will first be written locally and > afterwards either simply added to the pool of existing sstables or, in case > of existing MVs or active CDC, replayed on mutation basis: > As described in {{StreamReceiveTask.OnCompletionRunnable}}: > {quote} > We have a special path for views and for CDC. > For views, since the view requires cleaning up any pre-existing state, we > must put all partitions through the same write path as normal mutations. This > also ensures any 2is are also updated. > For CDC-enabled tables, we want to ensure that the mutations are run through > the CommitLog so they can be archived by the CDC process on discard. > {quote} > Using the regular write path turns out to be an issue for incremental > repairs, as we loose the {{repaired_at}} state in the process. Eventually the > streamed rows will end up in the unrepaired set, in contrast to the rows on > the sender site moved to the repaired set. The next repair run will stream > the same data back again, causing rows to bounce on and on between nodes on > each repair. > See linked dtest on steps to reproduce. An example for reproducing this > manually using ccm can be found > [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC
[ https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818955#comment-15818955 ] Benjamin Roth commented on CASSANDRA-12888: --- Hi Victor, We use MVs in Production with billions of records without known data loss. Painful + slow refers to repairs and range movements (e.g. bootstrap + decommission). Also (as mentioned in this ticket) incremental repairs dont work, so full repair creates some overhead. Until 3.10 there are bugs leading to write timeouts, even to NPEs and completely blocked mutation stages. This could even bring your cluster down. In 3.10 some issues have been resolved - actually we use a patched trunk version which is 1-2 months old. Depending on your model, MVs can help a lot from a developer perspective. Some cases are very resource intensive to manage without MVs, requiring distributed locks and/or CAS. For append-only workloads, it may be simpler to NOT use MVs at the moment. They aren't very complex and MVs wont help that much compared to the problems that may raise with them. Painful scenarios: There is no recipe for that. You may or may not encounter performance issues, depending on your model and your workload. I'd recommend not to use MVs that use a different partition key on the MV than on the base table as this requires inter-node communication for EVERY write operation. So you can easily kill your cluster with bulk operations (like in streaming). At the moment our cluster runs stable but it took months to find all the bottlenecks, race conditions, resume from failures and so on. So my recommendation: You can get it work but you need time and you should not start with critical data, at least if it is not backed by another stable storage. And you should use 3.10 when it is finally released or build your own version from trunk. I would not recommend to use < 3.10 for MVs. Btw.: Our own patched version does some dirty tricks, that may lead to inconsistencies in some situations but we prefer some possible inconsistencies (we can deal with) over performance bottlenecks. I created several tickets to improve MV performance in some streaming situations but it will take some time to really improve that situation. Does this answer your question? -- Benjamin Roth Prokurist Jaumo GmbH · www.jaumo.com Wehrstraße 46 · 73035 Göppingen · Germany Phone +49 7161 304880-6 · Fax +49 7161 304880-1 AG Ulm · HRB 731058 · Managing Director: Jens Kammerer > Incremental repairs broken for MVs and CDC > -- > > Key: CASSANDRA-12888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12888 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Stefan Podkowinski >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.0.x, 3.x > > > SSTables streamed during the repair process will first be written locally and > afterwards either simply added to the pool of existing sstables or, in case > of existing MVs or active CDC, replayed on mutation basis: > As described in {{StreamReceiveTask.OnCompletionRunnable}}: > {quote} > We have a special path for views and for CDC. > For views, since the view requires cleaning up any pre-existing state, we > must put all partitions through the same write path as normal mutations. This > also ensures any 2is are also updated. > For CDC-enabled tables, we want to ensure that the mutations are run through > the CommitLog so they can be archived by the CDC process on discard. > {quote} > Using the regular write path turns out to be an issue for incremental > repairs, as we loose the {{repaired_at}} state in the process. Eventually the > streamed rows will end up in the unrepaired set, in contrast to the rows on > the sender site moved to the repaired set. The next repair run will stream > the same data back again, causing rows to bounce on and on between nodes on > each repair. > See linked dtest on steps to reproduce. An example for reproducing this > manually using ccm can be found > [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC
[ https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818766#comment-15818766 ] Benjamin Roth commented on CASSANDRA-12888: --- It depends ;) there are known issues. Mostly related to repair and streaming. MV basically work and do what you expect of them. But maintenance jobs may be slow and or painful. So the good old saying is true: you can use them if you understand them and know what you are doing. But don't expect them to be like plug and play > Incremental repairs broken for MVs and CDC > -- > > Key: CASSANDRA-12888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12888 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Stefan Podkowinski >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.0.x, 3.x > > > SSTables streamed during the repair process will first be written locally and > afterwards either simply added to the pool of existing sstables or, in case > of existing MVs or active CDC, replayed on mutation basis: > As described in {{StreamReceiveTask.OnCompletionRunnable}}: > {quote} > We have a special path for views and for CDC. > For views, since the view requires cleaning up any pre-existing state, we > must put all partitions through the same write path as normal mutations. This > also ensures any 2is are also updated. > For CDC-enabled tables, we want to ensure that the mutations are run through > the CommitLog so they can be archived by the CDC process on discard. > {quote} > Using the regular write path turns out to be an issue for incremental > repairs, as we loose the {{repaired_at}} state in the process. Eventually the > streamed rows will end up in the unrepaired set, in contrast to the rows on > the sender site moved to the repaired set. The next repair run will stream > the same data back again, causing rows to bounce on and on between nodes on > each repair. > See linked dtest on steps to reproduce. An example for reproducing this > manually using ccm can be found > [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-13073) Optimize repair behaviour with MVs
Benjamin Roth created CASSANDRA-13073: - Summary: Optimize repair behaviour with MVs Key: CASSANDRA-13073 URL: https://issues.apache.org/jira/browse/CASSANDRA-13073 Project: Cassandra Issue Type: Bug Reporter: Benjamin Roth I am referring to a discusson on the dev list about the MV streaming issues discussed in 12888. It turned out that under some circumstances, repairing MVs can lead to inconsistencies. To remove that inconsistencies, it is necessary, to repair the base table first and then the MV again. These inconsistencies can be created both by read repair or CF/Keyspace repair. Proposition: - Exclude MVs from keyspace repairs - Disable read repairs on MVs or transform them to a read repair of the base table (maybe complicated but possible) Explanation: - CF base has PK a and field b - MV has PK a, b - 2 nodes n1 + n2, no hints - Initial state is a=1,b=1 at time t=0 - Node n2 goes down - Mutation a=1, b=2 at time t=1 - Node n2 comes up and node n1 goes down - Mutation a=1, b=3 at time t=2 - Node n1.mv contains: a1=1, b=3 + tombstone for a1=1,b=1 - Node n2.mv contains: a1=1, b=2 + tombstone for a1=1,b=1 When doing a repair on mv _before_ repairing base, mv would look like: - Node n1.mv contains: a1=1, b=3 + tombstone for a1=1,b=1 + a1=1, b=2 - Node n2.mv contains: a1=1, b=2 + tombstone for a1=1,b=1 + a1=1, b=3 Repairing _only_ the base table would create the correct result: - Node n1.mv contains: a1=1, b=3 + tombstone for a1=1,b=1 + tombstone for a1=1,b=2 - Node n2.mv contains: a1=1, b=3 + tombstone for a1=1,b=1 (TS for a1=2,b=2 should not have been created as b=3 was there, which shadows b=2 and should not reach the MV at all) All this does not apply if CASSANDRA-13066 will be implemented and enabled -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-13066) Fast repair with materialized views
Benjamin Roth created CASSANDRA-13066: - Summary: Fast repair with materialized views Key: CASSANDRA-13066 URL: https://issues.apache.org/jira/browse/CASSANDRA-13066 Project: Cassandra Issue Type: Improvement Reporter: Benjamin Roth I propose adding a configuration option to send streams of tables with MVs not through the regular write path. This may be either a global option or better a CF option. Background: A repair of a CF with an MV that is much out of sync creates many streams. These streams all go through the regular write path to assert local consistency of the MV. This again causes a read before write for every single mutation which again puts a lot of pressure on the node - much more than simply streaming the SSTable down. In some cases this can be avoided. Instead of only repairing the base table, all base + mv tables would have to be repaired. But this can break eventual consistency between base table and MV. The proposed behaviour is always safe, when having append-only MVs. It also works when using CL_QUORUM writes but it cannot be absolutely guaranteed, that a quorum write is applied atomically, so this can also lead to inconsistencies, if a quorum write is started but one node dies in the middle of a request. So, this proposal can help a lot in some situations but also can break consistency in others. That's why it should be left upon the operator if that behaviour is appropriate for individual use cases. This issue came up here: https://issues.apache.org/jira/browse/CASSANDRA-12888?focusedCommentId=15736599=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15736599 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths
[ https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766478#comment-15766478 ] Benjamin Roth commented on CASSANDRA-13065: --- I marked it as critical as it severly affects cluster maintainability when using MVs. Maybe it's also worth to be considered as a bug. > Consistent range movements to not require MV updates to go through write > paths > --- > > Key: CASSANDRA-13065 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13065 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth >Priority: Critical > > Booting or decommisioning nodes with MVs is unbearably slow as all streams go > through the regular write paths. This causes read-before-writes for every > mutation and during bootstrap it causes them to be sent to batchlog. > The makes it virtually impossible to boot a new node in an acceptable amount > of time. > Using the regular streaming behaviour for consistent range movements works > much better in this case and does not break the MV local consistency contract. > Already tested on own cluster. > Bootstrap case is super easy to handle, decommission case requires > CASSANDRA-13064 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths
[ https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Roth updated CASSANDRA-13065: -- Priority: Critical (was: Major) > Consistent range movements to not require MV updates to go through write > paths > --- > > Key: CASSANDRA-13065 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13065 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth >Priority: Critical > > Booting or decommisioning nodes with MVs is unbearably slow as all streams go > through the regular write paths. This causes read-before-writes for every > mutation and during bootstrap it causes them to be sent to batchlog. > The makes it virtually impossible to boot a new node in an acceptable amount > of time. > Using the regular streaming behaviour for consistent range movements works > much better in this case and does not break the MV local consistency contract. > Already tested on own cluster. > Bootstrap case is super easy to handle, decommission case requires > CASSANDRA-13064 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths
[ https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766474#comment-15766474 ] Benjamin Roth commented on CASSANDRA-13065: --- Required for decommissioning > Consistent range movements to not require MV updates to go through write > paths > --- > > Key: CASSANDRA-13065 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13065 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth > > Booting or decommisioning nodes with MVs is unbearably slow as all streams go > through the regular write paths. This causes read-before-writes for every > mutation and during bootstrap it causes them to be sent to batchlog. > The makes it virtually impossible to boot a new node in an acceptable amount > of time. > Using the regular streaming behaviour for consistent range movements works > much better in this case and does not break the MV local consistency contract. > Already tested on own cluster. > Bootstrap case is super easy to handle, decommission case requires > CASSANDRA-13064 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths
Benjamin Roth created CASSANDRA-13065: - Summary: Consistent range movements to not require MV updates to go through write paths Key: CASSANDRA-13065 URL: https://issues.apache.org/jira/browse/CASSANDRA-13065 Project: Cassandra Issue Type: Improvement Reporter: Benjamin Roth Booting or decommisioning nodes with MVs is unbearably slow as all streams go through the regular write paths. This causes read-before-writes for every mutation and during bootstrap it causes them to be sent to batchlog. The makes it virtually impossible to boot a new node in an acceptable amount of time. Using the regular streaming behaviour for consistent range movements works much better in this case and does not break the MV local consistency contract. Already tested on own cluster. Bootstrap case is super easy to handle, decommission case requires CASSANDRA-13064 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-13064) Add stream type or purpose to stream plan / stream
Benjamin Roth created CASSANDRA-13064: - Summary: Add stream type or purpose to stream plan / stream Key: CASSANDRA-13064 URL: https://issues.apache.org/jira/browse/CASSANDRA-13064 Project: Cassandra Issue Type: Improvement Reporter: Benjamin Roth It would be very good to know the type or purpose of a certain stream on the receiver side. It should be both available in a stream request and a stream task. Why? It would be helpful to distinguish the purpose to allow different handling of streams and requests. Examples: - In stream request a global flush is done. This is not necessary for all types of streams. A repair stream(-plan) does not require a flush as this has been done shortly before in validation compaction and only the sstables that have been validated also have to be streamed. - In StreamReceiveTask streams for MVs go through the regular write path this is painfully slow especially on bootstrap and decomission. Both for bootstrap and decommission this is not necessary. Sstables can be directly streamed down in this case. Handling bootstrap is no problem as it relies on a local state but during decommission, the decom-state is bound to the sender and not the receiver, so the receiver has to know that it is safe to stream that sstable directly, not through the write-path. Thats why we have to know the purpose of the stream. I'd love to implement this on my own but I am not sure how not to break the streaming protocol for backwards compat or if it is ok to do so. Furthermore I'd love to get some feedback on that idea and some proposals what stream types to distinguish. I could imagine: - bootstrap - decommission - repair - replace node - remove node - range relocation Comments like this support my idea, knowing the purpose could avoid this. {quote} // TODO each call to transferRanges re-flushes, this is potentially a lot of waste streamPlan.transferRanges(newEndpoint, preferred, keyspaceName, ranges); {/quote} And alternative to passing the purpose of the stream was to pass flags like: - requiresFlush - requiresWritePathForMaterializedView ... I guess passing the purpose will make the streaming protocol more robust for future changes and leaves decisions up to the receiver. But an additional "requiresFlush" would also avoid putting too much logic into the streaming code. The streaming code should not care about purposes, the caller or receiver should. So the decision if a stream requires as flush before stream should be up to the stream requester and the stream request receiver depending on the purpose of the stream. I'm excited about your feedback :) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-13064) Add stream type or purpose to stream plan / stream
[ https://issues.apache.org/jira/browse/CASSANDRA-13064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Roth updated CASSANDRA-13064: -- Description: It would be very good to know the type or purpose of a certain stream on the receiver side. It should be both available in a stream request and a stream task. Why? It would be helpful to distinguish the purpose to allow different handling of streams and requests. Examples: - In stream request a global flush is done. This is not necessary for all types of streams. A repair stream(-plan) does not require a flush as this has been done shortly before in validation compaction and only the sstables that have been validated also have to be streamed. - In StreamReceiveTask streams for MVs go through the regular write path this is painfully slow especially on bootstrap and decomission. Both for bootstrap and decommission this is not necessary. Sstables can be directly streamed down in this case. Handling bootstrap is no problem as it relies on a local state but during decommission, the decom-state is bound to the sender and not the receiver, so the receiver has to know that it is safe to stream that sstable directly, not through the write-path. Thats why we have to know the purpose of the stream. I'd love to implement this on my own but I am not sure how not to break the streaming protocol for backwards compat or if it is ok to do so. Furthermore I'd love to get some feedback on that idea and some proposals what stream types to distinguish. I could imagine: - bootstrap - decommission - repair - replace node - remove node - range relocation Comments like this support my idea, knowing the purpose could avoid this. {quote} // TODO each call to transferRanges re-flushes, this is potentially a lot of waste streamPlan.transferRanges(newEndpoint, preferred, keyspaceName, ranges); {quote} And alternative to passing the purpose of the stream was to pass flags like: - requiresFlush - requiresWritePathForMaterializedView ... I guess passing the purpose will make the streaming protocol more robust for future changes and leaves decisions up to the receiver. But an additional "requiresFlush" would also avoid putting too much logic into the streaming code. The streaming code should not care about purposes, the caller or receiver should. So the decision if a stream requires as flush before stream should be up to the stream requester and the stream request receiver depending on the purpose of the stream. I'm excited about your feedback :) was: It would be very good to know the type or purpose of a certain stream on the receiver side. It should be both available in a stream request and a stream task. Why? It would be helpful to distinguish the purpose to allow different handling of streams and requests. Examples: - In stream request a global flush is done. This is not necessary for all types of streams. A repair stream(-plan) does not require a flush as this has been done shortly before in validation compaction and only the sstables that have been validated also have to be streamed. - In StreamReceiveTask streams for MVs go through the regular write path this is painfully slow especially on bootstrap and decomission. Both for bootstrap and decommission this is not necessary. Sstables can be directly streamed down in this case. Handling bootstrap is no problem as it relies on a local state but during decommission, the decom-state is bound to the sender and not the receiver, so the receiver has to know that it is safe to stream that sstable directly, not through the write-path. Thats why we have to know the purpose of the stream. I'd love to implement this on my own but I am not sure how not to break the streaming protocol for backwards compat or if it is ok to do so. Furthermore I'd love to get some feedback on that idea and some proposals what stream types to distinguish. I could imagine: - bootstrap - decommission - repair - replace node - remove node - range relocation Comments like this support my idea, knowing the purpose could avoid this. {quote} // TODO each call to transferRanges re-flushes, this is potentially a lot of waste streamPlan.transferRanges(newEndpoint, preferred, keyspaceName, ranges); {/quote} And alternative to passing the purpose of the stream was to pass flags like: - requiresFlush - requiresWritePathForMaterializedView ... I guess passing the purpose will make the streaming protocol more robust for future changes and leaves decisions up to the receiver. But an additional "requiresFlush" would also avoid putting too much logic into the streaming code. The streaming code should not care about purposes, the caller or receiver should. So the decision if a stream requires as flush before stream should be up to the stream requester and the stream request receiver
[jira] [Comment Edited] (CASSANDRA-12905) Retry acquire MV lock on failure instead of throwing WTE on streaming
[ https://issues.apache.org/jira/browse/CASSANDRA-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750958#comment-15750958 ] Benjamin Roth edited comment on CASSANDRA-12905 at 12/15/16 10:00 AM: -- +1 for the dtest! Thanks for explaining that hint-thing. For my understanding of hint retransmission: If we have a hint file that is being processed and a WTE occurs in the middle, will the whole file be retransmitted or can it be resumed at the last successful position? I guess this is not the case from my personal observations. I had situations with > 1GB hint queues per sender node which were not going to shrink due to WTEs. It seemed like the same hints have been retransmitted over and over again from scratch instead of trying to resume. What helped in this situation was to pause hint delivery on all nodes but 1-2. I'm pretty sure the problem was WTEs due to lock contentions. To be honest, I did not try lowering hinted_handoff_throttle_in_kb or at least I don't remember. was (Author: brstgt): +1 for the dtest! For my understanding of hint retransmission: If we have a hint file that is being processed and a WTE occurs in the middle, will the whole file be retransmitted or can it be resumed at the last successful position? I guess this is not the case from my personal observations. I had situations with > 1GB hint queues per sender node which were not going to shrink due to WTEs. It seemed like the same hints have been retransmitted over and over again from scratch instead of trying to resume. What helped in this situation was to pause hint delivery on all nodes but 1-2. I'm pretty sure the problem was WTEs due to lock contentions. To be honest, I did not try lowering hinted_handoff_throttle_in_kb or at least I don't remember. > Retry acquire MV lock on failure instead of throwing WTE on streaming > - > > Key: CASSANDRA-12905 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12905 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: centos 6.7 x86_64 >Reporter: Nir Zilka >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.10 > > > Hello, > I performed two upgrades to the current cluster (currently 15 nodes, 1 DC, > private VLAN), > first it was 2.2.5.1 and repair worked flawlessly, > second upgrade was to 3.0.9 (with upgradesstables) and also repair worked > well, > then i upgraded 2 weeks ago to 3.9 - and the repair problems started. > there are several errors types from the system.log (different nodes) : > - Sync failed between /xxx.xxx.xxx.xxx and /xxx.xxx.xxx.xxx > - Streaming error occurred on session with peer xxx.xxx.xxx.xxx Operation > timed out - received only 0 responses > - Remote peer xxx.xxx.xxx.xxx failed stream session > - Session completed with the following error > org.apache.cassandra.streaming.StreamException: Stream failed > > i use 3.9 default configuration with the cluster settings adjustments (3 > seeds, GossipingPropertyFileSnitch). > streaming_socket_timeout_in_ms is the default (8640). > i'm afraid from consistency problems while i'm not performing repair. > Any ideas? > Thanks, > Nir. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12905) Retry acquire MV lock on failure instead of throwing WTE on streaming
[ https://issues.apache.org/jira/browse/CASSANDRA-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750958#comment-15750958 ] Benjamin Roth commented on CASSANDRA-12905: --- +1 for the dtest! For my understanding of hint retransmission: If we have a hint file that is being processed and a WTE occurs in the middle, will the whole file be retransmitted or can it be resumed at the last successful position? I guess this is not the case from my personal observations. I had situations with > 1GB hint queues per sender node which were not going to shrink due to WTEs. It seemed like the same hints have been retransmitted over and over again from scratch instead of trying to resume. What helped in this situation was to pause hint delivery on all nodes but 1-2. I'm pretty sure the problem was WTEs due to lock contentions. To be honest, I did not try lowering hinted_handoff_throttle_in_kb or at least I don't remember. > Retry acquire MV lock on failure instead of throwing WTE on streaming > - > > Key: CASSANDRA-12905 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12905 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: centos 6.7 x86_64 >Reporter: Nir Zilka >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.10 > > > Hello, > I performed two upgrades to the current cluster (currently 15 nodes, 1 DC, > private VLAN), > first it was 2.2.5.1 and repair worked flawlessly, > second upgrade was to 3.0.9 (with upgradesstables) and also repair worked > well, > then i upgraded 2 weeks ago to 3.9 - and the repair problems started. > there are several errors types from the system.log (different nodes) : > - Sync failed between /xxx.xxx.xxx.xxx and /xxx.xxx.xxx.xxx > - Streaming error occurred on session with peer xxx.xxx.xxx.xxx Operation > timed out - received only 0 responses > - Remote peer xxx.xxx.xxx.xxx failed stream session > - Session completed with the following error > org.apache.cassandra.streaming.StreamException: Stream failed > > i use 3.9 default configuration with the cluster settings adjustments (3 > seeds, GossipingPropertyFileSnitch). > streaming_socket_timeout_in_ms is the default (8640). > i'm afraid from consistency problems while i'm not performing repair. > Any ideas? > Thanks, > Nir. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-12991) Inter-node race condition in validation compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750908#comment-15750908 ] Benjamin Roth edited comment on CASSANDRA-12991 at 12/15/16 9:43 AM: - I created a little script to calculate some possible scenarios: https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99 Output: {quote} * Calculates the likeliness of a race condition leading to unnecessary repairs * @see https://issues.apache.org/jira/browse/CASSANDRA-12991 * * This assumes that all writes are equally spread over all token ranges * and there is one subrange repair executed for each existing token range 3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 0.39% Unnecessary range repairs per repair: 3.00 3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 1.56% Unnecessary range repairs per repair: 12.00 8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 req/s Total Ranges: 2048 Likeliness for RC per range: 2.93% Unnecessary range repairs per repair: 60.00 8 Nodes, 256 Tokens / Node, 20ms Mutation Latency, 1ms Validation Latency, 5000 req/s Total Ranges: 2048 Likeliness for RC per range: 5.37% Unnecessary range repairs per repair: 110.00 {quote} You may ask why I entered latencies like 10ms or 20ms - this seems quite high. It is indeed quite high for regular tables and a cluster that is not overloaded. Under these conditions, the latency is dominated by your network latency, so 1ms seems quite fair to me. As soon as you use MVs and your cluster tends to overload, higher latencies are not unrealistic. You have to take into account that an MV operation does read before write and the latency may vary very much. For MVs the latency is not (only) any more dominated by network latency but by MV lock aquisition and read before write. Both factors can introduce MUCH higher latencies, depending on concurrent operations on MV, number of SSTables, compaction strategy, just everything that affects read performance. If your cluster is overloaded, these effects have an even higher impact. I observed MANY situations on our production system where writes timed out during streaming because of lock contention and or RBW impacts. These situations mainly pop up during repair sessions when streams cause bulk mutation applies (see StreamReceiverTask path for MVs). Impact is even higher due to CASSANDRA-12888. Parallel repairs like e.g. reaper does, makes the situation even more unpredictable and increases "drifts" of nodes, like Node A is overloaded but Node B not because Node A receives a stream from a different repair but Node B does not. This is a vicious circle driven several factors: - Stream puts pressure on nodes - especially larg(er) partitions - hints tend to queue up - hint delivery puts more pressure - retransmission of failed hint delivery puts even more pressure - latencies go up - stream validations drift - more (unnecessary) streams - goto 0 This calculation example is just hypothetic. This *may* happen as calculated but it totally depends on the model, cluster dimensions, cluster load, write activity, distribution of writes and repair execution. I don't claim that fixing this issue will remove all MV performance problems but it may be helps to remove one impediment in the mentioned vicious circle. My proposal is NOT to control flushes. This is far too complicated and wont help. A flush, whenever it may happen and whatever range it flushes may or may not contain a mutation that _should_ be there. What helps is to cut off all data retrospectively at a synchronized and fix timestamp when executing the validation. You can define a grace period (GP). When you start validation at VS on the repair coordinator, then you expect all mutations to arrive no later than VS that were created before VS - GP. That can be done at SSTable scanner level by filtering all events (cells, tombstones) after VS - GP during validation compaction. Something like the opposite of purging tombstones after GCGS. was (Author: brstgt): I created a little script to calculate some possible scenarios: https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99 Output: {quote} * Calculates the likeliness of a race condition leading to unnecessary repairs * @see https://issues.apache.org/jira/browse/CASSANDRA-12991 * * This assumes that all writes are equally spread over all token ranges * and there is one subrange repair executed for each existing token range 3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 0.39% Unnecessary range repairs per repair: 3.00 3 Nodes, 256 Tokens / Node, 10ms Mutation
[jira] [Comment Edited] (CASSANDRA-12991) Inter-node race condition in validation compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750908#comment-15750908 ] Benjamin Roth edited comment on CASSANDRA-12991 at 12/15/16 9:34 AM: - I created a little script to calculate some possible scenarios: https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99 Output: {quote} * Calculates the likeliness of a race condition leading to unnecessary repairs * @see https://issues.apache.org/jira/browse/CASSANDRA-12991 * * This assumes that all writes are equally spread over all token ranges * and there is one subrange repair executed for each existing token range 3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 0.39% Unnecessary range repairs per repair: 3.00 3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 1.56% Unnecessary range repairs per repair: 12.00 8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 req/s Total Ranges: 2048 Likeliness for RC per range: 2.93% Unnecessary range repairs per repair: 60.00 8 Nodes, 256 Tokens / Node, 20ms Mutation Latency, 1ms Validation Latency, 5000 req/s Total Ranges: 2048 Likeliness for RC per range: 5.37% Unnecessary range repairs per repair: 110.00 {quote} You may ask why I entered latencies like 10ms or 20ms - this seems quite high. It is indeed quite high for regular tables and a cluster that is not overloaded. Under these conditions, the latency is dominated by your network latency, so 1ms seems quite fair to me. As soon as you use MVs and your cluster tends to overload, higher latencies are not unrealistic. You have to take into account that an MV operation does read before write and the latency may vary very much. For MVs the latency is not (only) any more dominated by network latency but by MV lock aquisition and read before write. Both factors can introduce MUCH higher latencies, depending on concurrent operations on MV, number of SSTables, compaction strategy, just everything that affects read performance. If your cluster is overloaded, these effects have an even higher impact. I observed MANY situations on our production system where writes timed out during streaming because of lock contention and or RBW impacts. These situations mainly pop up during repair sessions when streams cause bulk mutation applies (see StreamReceiverTask path for MVs). Impact is even higher due to CASSANDRA-12888. Parallel repairs like e.g. reaper does, makes the situation even more unpredictable and increases "drifts" of nodes, like Node A is overloaded but Node B not because Node A receives a stream from a different repair but Node B does not. This is a vicious circle driven several factors: - Stream puts pressure on nodes - especially larg(er) partitions - hints tend to queue up - hint delivery puts more pressure - retransmission of failed hint delivery puts even more pressure - latencies go up - stream validations drift - more (unnecessary) streams - goto 0 This calculation example is just hypothetic. This *may* happen as calculated but it totally depends on the model, cluster dimensions, cluster load, write activity, distribution of writes and repair execution. I don't claim that fixing this issue will remove all MV performance problems but it may be helps to remove one impediment in the mentioned vicious circle. My proposal is NOT to control flushes. This is far too complicated and wont help. A flush, whenever it may happen and whatever range it flushes may or may not contain a mutation that _should_ be there. The only thing that helps is to cut off all data retrospectively at a synchronized and fix timestamp when executing the validation. You can only define a grace period (GP). When you start validation at VS on the repair coordinator, then you expect all mutations to arrive no later than VS that were created before VS - GP. That can IMHO only be done at SSTable scanner level by filtering all events (cells, tombstones) after VS - GP during validation compaction. Something like the opposite of purging tombstones after GCGS. was (Author: brstgt): I created a little script to calculate some possible scenarios: https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99 Output: {quote} 3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 0.39% Unnecessary range repairs per repair: 3.00 3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 1.56% Unnecessary range repairs per repair: 12.00 8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 req/s Total Ranges: 2048 Likeliness for
[jira] [Comment Edited] (CASSANDRA-12991) Inter-node race condition in validation compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750908#comment-15750908 ] Benjamin Roth edited comment on CASSANDRA-12991 at 12/15/16 9:34 AM: - I created a little script to calculate some possible scenarios: https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99 Output: {quote} 3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 0.39% Unnecessary range repairs per repair: 3.00 3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 1.56% Unnecessary range repairs per repair: 12.00 8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 req/s Total Ranges: 2048 Likeliness for RC per range: 2.93% Unnecessary range repairs per repair: 60.00 8 Nodes, 256 Tokens / Node, 20ms Mutation Latency, 1ms Validation Latency, 5000 req/s Total Ranges: 2048 Likeliness for RC per range: 5.37% Unnecessary range repairs per repair: 110.00 {quote} You may ask why I entered latencies like 10ms or 20ms - this seems quite high. It is indeed quite high for regular tables and a cluster that is not overloaded. Under these conditions, the latency is dominated by your network latency, so 1ms seems quite fair to me. As soon as you use MVs and your cluster tends to overload, higher latencies are not unrealistic. You have to take into account that an MV operation does read before write and the latency may vary very much. For MVs the latency is not (only) any more dominated by network latency but by MV lock aquisition and read before write. Both factors can introduce MUCH higher latencies, depending on concurrent operations on MV, number of SSTables, compaction strategy, just everything that affects read performance. If your cluster is overloaded, these effects have an even higher impact. I observed MANY situations on our production system where writes timed out during streaming because of lock contention and or RBW impacts. These situations mainly pop up during repair sessions when streams cause bulk mutation applies (see StreamReceiverTask path for MVs). Impact is even higher due to CASSANDRA-12888. Parallel repairs like e.g. reaper does, makes the situation even more unpredictable and increases "drifts" of nodes, like Node A is overloaded but Node B not because Node A receives a stream from a different repair but Node B does not. This is a vicious circle driven several factors: - Stream puts pressure on nodes - especially larg(er) partitions - hints tend to queue up - hint delivery puts more pressure - retransmission of failed hint delivery puts even more pressure - latencies go up - stream validations drift - more (unnecessary) streams - goto 0 This calculation example is just hypothetic. This *may* happen as calculated but it totally depends on the model, cluster dimensions, cluster load, write activity, distribution of writes and repair execution. I don't claim that fixing this issue will remove all MV performance problems but it may be helps to remove one impediment in the mentioned vicious circle. My proposal is NOT to control flushes. This is far too complicated and wont help. A flush, whenever it may happen and whatever range it flushes may or may not contain a mutation that _should_ be there. The only thing that helps is to cut off all data retrospectively at a synchronized and fix timestamp when executing the validation. You can only define a grace period (GP). When you start validation at VS on the repair coordinator, then you expect all mutations to arrive no later than VS that were created before VS - GP. That can IMHO only be done at SSTable scanner level by filtering all events (cells, tombstones) after VS - GP during validation compaction. Something like the opposite of purging tombstones after GCGS. was (Author: brstgt): I created a little script to calculate some possible scenarios: https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99 Output: {quote} 3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 0.39% Unnecessary range repairs per repair: 3.00 3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 1.56% Unnecessary range repairs per repair: 12.00 8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 req/s Total Ranges: 2048 Likeliness for RC per range: 2.93% Unnecessary range repairs per repair: 60.00 8 Nodes, 256 Tokens / Node, 20ms Mutation Latency, 1ms Validation Latency, 5000 req/s Total Ranges: 2048 Likeliness for RC per range: 5.37% Unnecessary range repairs per repair: 110.00 {/quote} You may ask why I entered
[jira] [Comment Edited] (CASSANDRA-12991) Inter-node race condition in validation compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750908#comment-15750908 ] Benjamin Roth edited comment on CASSANDRA-12991 at 12/15/16 9:33 AM: - I created a little script to calculate some possible scenarios: https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99 Output: {quote} 3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 0.39% Unnecessary range repairs per repair: 3.00 3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 1.56% Unnecessary range repairs per repair: 12.00 8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 req/s Total Ranges: 2048 Likeliness for RC per range: 2.93% Unnecessary range repairs per repair: 60.00 8 Nodes, 256 Tokens / Node, 20ms Mutation Latency, 1ms Validation Latency, 5000 req/s Total Ranges: 2048 Likeliness for RC per range: 5.37% Unnecessary range repairs per repair: 110.00 {/quote} You may ask why I entered latencies like 10ms or 20ms - this seems quite high. It is indeed quite high for regular tables and a cluster that is not overloaded. Under these conditions, the latency is dominated by your network latency, so 1ms seems quite fair to me. As soon as you use MVs and your cluster tends to overload, higher latencies are not unrealistic. You have to take into account that an MV operation does read before write and the latency may vary very much. For MVs the latency is not (only) any more dominated by network latency but by MV lock aquisition and read before write. Both factors can introduce MUCH higher latencies, depending on concurrent operations on MV, number of SSTables, compaction strategy, just everything that affects read performance. If your cluster is overloaded, these effects have an even higher impact. I observed MANY situations on our production system where writes timed out during streaming because of lock contention and or RBW impacts. These situations mainly pop up during repair sessions when streams cause bulk mutation applies (see StreamReceiverTask path for MVs). Impact is even higher due to CASSANDRA-12888. Parallel repairs like e.g. reaper does, makes the situation even more unpredictable and increases "drifts" of nodes, like Node A is overloaded but Node B not because Node A receives a stream from a different repair but Node B does not. This is a vicious circle driven several factors: - Stream puts pressure on nodes - especially larg(er) partitions - hints tend to queue up - hint delivery puts more pressure - retransmission of failed hint delivery puts even more pressure - latencies go up - stream validations drift - more (unnecessary) streams - goto 0 This calculation example is just hypothetic. This *may* happen as calculated but it totally depends on the model, cluster dimensions, cluster load, write activity, distribution of writes and repair execution. I don't claim that fixing this issue will remove all MV performance problems but it may be helps to remove one impediment in the mentioned vicious circle. My proposal is NOT to control flushes. This is far too complicated and wont help. A flush, whenever it may happen and whatever range it flushes may or may not contain a mutation that _should_ be there. The only thing that helps is to cut off all data retrospectively at a synchronized and fix timestamp when executing the validation. You can only define a grace period (GP). When you start validation at VS on the repair coordinator, then you expect all mutations to arrive no later than VS that were created before VS - GP. That can IMHO only be done at SSTable scanner level by filtering all events (cells, tombstones) after VS - GP during validation compaction. Something like the opposite of purging tombstones after GCGS. was (Author: brstgt): I created a little script to calculate some possible scenarios: https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99 Output: {quote} * Calculates the likeliness of a race condition leading to unnecessary repairs * @see https://issues.apache.org/jira/browse/CASSANDRA-12991 * * This assumes that all writes are equally spread over all token ranges * and there is one subrange repair executed for each existing token range 3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 0.39% Unnecessary range repairs per repair: 3.00 3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 1.56% Unnecessary range repairs per repair: 12.00 8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 req/s Total Ranges: 2048 Likeliness
[jira] [Commented] (CASSANDRA-12991) Inter-node race condition in validation compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750908#comment-15750908 ] Benjamin Roth commented on CASSANDRA-12991: --- I created a little script to calculate some possible scenarios: https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99 Output: {quote} * Calculates the likeliness of a race condition leading to unnecessary repairs * @see https://issues.apache.org/jira/browse/CASSANDRA-12991 * * This assumes that all writes are equally spread over all token ranges * and there is one subrange repair executed for each existing token range 3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 0.39% Unnecessary range repairs per repair: 3.00 3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 1.56% Unnecessary range repairs per repair: 12.00 8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 req/s Total Ranges: 2048 Likeliness for RC per range: 2.93% Unnecessary range repairs per repair: 60.00 8 Nodes, 256 Tokens / Node, 20ms Mutation Latency, 1ms Validation Latency, 5000 req/s Total Ranges: 2048 Likeliness for RC per range: 5.37% Unnecessary range repairs per repair: 110.00 {/quote} You may ask why I entered latencies like 10ms or 20ms - this seems quite high. It is indeed quite high for regular tables and a cluster that is not overloaded. Under these conditions, the latency is dominated by your network latency, so 1ms seems quite fair to me. As soon as you use MVs and your cluster tends to overload, higher latencies are not unrealistic. You have to take into account that an MV operation does read before write and the latency may vary very much. For MVs the latency is not (only) any more dominated by network latency but by MV lock aquisition and read before write. Both factors can introduce MUCH higher latencies, depending on concurrent operations on MV, number of SSTables, compaction strategy, just everything that affects read performance. If your cluster is overloaded, these effects have an even higher impact. I observed MANY situations on our production system where writes timed out during streaming because of lock contention and or RBW impacts. These situations mainly pop up during repair sessions when streams cause bulk mutation applies (see StreamReceiverTask path for MVs). Impact is even higher due to CASSANDRA-12888. Parallel repairs like e.g. reaper does, makes the situation even more unpredictable and increases "drifts" of nodes, like Node A is overloaded but Node B not because Node A receives a stream from a different repair but Node B does not. This is a vicious circle driven several factors: - Stream puts pressure on nodes - especially larg(er) partitions - hints tend to queue up - hint delivery puts more pressure - retransmission of failed hint delivery puts even more pressure - latencies go up - stream validations drift - more (unnecessary) streams - goto 0 This calculation example is just hypothetic. This *may* happen as calculated but it totally depends on the model, cluster dimensions, cluster load, write activity, distribution of writes and repair execution. I don't claim that fixing this issue will remove all MV performance problems but it may be helps to remove one impediment in the mentioned vicious circle. My proposal is NOT to control flushes. This is far too complicated and wont help. A flush, whenever it may happen and whatever range it flushes may or may not contain a mutation that _should_ be there. The only thing that helps is to cut off all data retrospectively at a synchronized and fix timestamp when executing the validation. You can only define a grace period (GP). When you start validation at VS on the repair coordinator, then you expect all mutations to arrive no later than VS that were created before VS - GP. That can IMHO only be done at SSTable scanner level by filtering all events (cells, tombstones) after VS - GP during validation compaction. Something like the opposite of purging tombstones after GCGS. > Inter-node race condition in validation compaction > -- > > Key: CASSANDRA-12991 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12991 > Project: Cassandra > Issue Type: Improvement >Reporter: Benjamin Roth >Priority: Minor > > Problem: > When a validation compaction is triggered by a repair it may happen that due > to flying in mutations the merkle trees differ but the data is consistent > however. > Example: > t = 1: > Repair starts, triggers validations > Node A starts validation > t = 10001: > Mutation arrives at Node A > t = 10002: > Mutation arrives at
[jira] [Commented] (CASSANDRA-12905) Retry acquire MV lock on failure instead of throwing WTE on streaming
[ https://issues.apache.org/jira/browse/CASSANDRA-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747566#comment-15747566 ] Benjamin Roth commented on CASSANDRA-12905: --- I like your naming changes, they make sense. But by making hint delivery async, you made it droppable again. I guess this is not intentional? You also do not reply on an exception. I am not familiar with request/response handling but I guess an exception (like WTE) will just drop the hint and let the hint-sender wait for a reply infinitely or until it times out? The hint sender should be able to recover from an exception like re-transmitting the hints, right? I am not sure if this is the case here. > Retry acquire MV lock on failure instead of throwing WTE on streaming > - > > Key: CASSANDRA-12905 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12905 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: centos 6.7 x86_64 >Reporter: Nir Zilka >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.10 > > > Hello, > I performed two upgrades to the current cluster (currently 15 nodes, 1 DC, > private VLAN), > first it was 2.2.5.1 and repair worked flawlessly, > second upgrade was to 3.0.9 (with upgradesstables) and also repair worked > well, > then i upgraded 2 weeks ago to 3.9 - and the repair problems started. > there are several errors types from the system.log (different nodes) : > - Sync failed between /xxx.xxx.xxx.xxx and /xxx.xxx.xxx.xxx > - Streaming error occurred on session with peer xxx.xxx.xxx.xxx Operation > timed out - received only 0 responses > - Remote peer xxx.xxx.xxx.xxx failed stream session > - Session completed with the following error > org.apache.cassandra.streaming.StreamException: Stream failed > > i use 3.9 default configuration with the cluster settings adjustments (3 > seeds, GossipingPropertyFileSnitch). > streaming_socket_timeout_in_ms is the default (8640). > i'm afraid from consistency problems while i'm not performing repair. > Any ideas? > Thanks, > Nir. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC
[ https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737403#comment-15737403 ] Benjamin Roth commented on CASSANDRA-12888: --- HI Paulo, Thanks for the congrats! About your proposal to skip the base table mutations: I haven't analyzed it thoroughly (no time, you know) but my intuition says that there will be race conditions and possible inconsistencies as you "pick out" the base table mutation out of the lock phase. I guess to assert the base table <> view replica consistency you'd have to lock the whole CF while streaming a single SSTable to assert that the MV mutations are processed serial and no other base table mutations slip in from the mutation stage that mess with the consistency. As far as i can see, base table apply, base-read and MV mutations MUST be serialized (actually that's why there's a lock). Otherwise you will have stale MV rows again. This is why I think this proposal won't work. Or did I miss the point? CDC: This case should be quite simple. I think you don't need the write path at all and just have to write the incoming mutations to commit log additionally to streaming the sstable. In the worst case, server crashed, commit log replay leads to redundant and unrepaired entries but this should be a rare and recoverable situation. What do you think? > Incremental repairs broken for MVs and CDC > -- > > Key: CASSANDRA-12888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12888 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Stefan Podkowinski >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.0.x, 3.x > > > SSTables streamed during the repair process will first be written locally and > afterwards either simply added to the pool of existing sstables or, in case > of existing MVs or active CDC, replayed on mutation basis: > As described in {{StreamReceiveTask.OnCompletionRunnable}}: > {quote} > We have a special path for views and for CDC. > For views, since the view requires cleaning up any pre-existing state, we > must put all partitions through the same write path as normal mutations. This > also ensures any 2is are also updated. > For CDC-enabled tables, we want to ensure that the mutations are run through > the CommitLog so they can be archived by the CDC process on discard. > {quote} > Using the regular write path turns out to be an issue for incremental > repairs, as we loose the {{repaired_at}} state in the process. Eventually the > streamed rows will end up in the unrepaired set, in contrast to the rows on > the sender site moved to the repaired set. The next repair run will stream > the same data back again, causing rows to bounce on and on between nodes on > each repair. > See linked dtest on steps to reproduce. An example for reproducing this > manually using ccm can be found > [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-12905) Retry acquire MV lock on failure instead of throwing WTE on streaming
[ https://issues.apache.org/jira/browse/CASSANDRA-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15735402#comment-15735402 ] Benjamin Roth edited comment on CASSANDRA-12905 at 12/9/16 2:02 PM: Hi Paulo, Indeed, my time is currently VERY limited, my wife came back home from hospital with our new born child. So currently I am not able to think about all that with the concentration it requires and deserves. So I suggest, you apply cosmetic changes on your own - after all you have much more experience with the code base, so I'd leave these decisions up to you anyway. I personally would not know how to interpret the name "droppable" but if you say there is a pattern that is also used somewhere else, why not. According to your feedback I absolutely support your decision to overwork and or overthink this issue and address it later for not to block the release. My plan was to get back to all that issues next week when (hopefully) my CPU is a little bit more idle again :) Probably I will then ask for help / a second brain. Thanks so far! was (Author: brstgt): Hi Pualo, Indeed, my time is currently VERY limited, my wife came back home from hospital with our new born child. So currently I am not able to think about all that with the concentration it requires and deserves. So I suggest, you apply cosmetic changes on your own - after all you have much more experience with the code base, so I'd leave these decisions up to you anyway. I personally would not know how to interpret the name "droppable" but if you say there is a pattern that is also used somewhere else, why not. According to your feedback I absolutely support your decision to overwork and or overthink this issue and address it later for not to block the release. My plan was to get back to all that issues next week when (hopefully) my CPU is a little bit more idle again :) Probably I will then ask for help / a second brain. Thanks so far! > Retry acquire MV lock on failure instead of throwing WTE on streaming > - > > Key: CASSANDRA-12905 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12905 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: centos 6.7 x86_64 >Reporter: Nir Zilka >Assignee: Benjamin Roth >Priority: Critical > Fix For: 3.10 > > > Hello, > I performed two upgrades to the current cluster (currently 15 nodes, 1 DC, > private VLAN), > first it was 2.2.5.1 and repair worked flawlessly, > second upgrade was to 3.0.9 (with upgradesstables) and also repair worked > well, > then i upgraded 2 weeks ago to 3.9 - and the repair problems started. > there are several errors types from the system.log (different nodes) : > - Sync failed between /xxx.xxx.xxx.xxx and /xxx.xxx.xxx.xxx > - Streaming error occurred on session with peer xxx.xxx.xxx.xxx Operation > timed out - received only 0 responses > - Remote peer xxx.xxx.xxx.xxx failed stream session > - Session completed with the following error > org.apache.cassandra.streaming.StreamException: Stream failed > > i use 3.9 default configuration with the cluster settings adjustments (3 > seeds, GossipingPropertyFileSnitch). > streaming_socket_timeout_in_ms is the default (8640). > i'm afraid from consistency problems while i'm not performing repair. > Any ideas? > Thanks, > Nir. -- This message was sent by Atlassian JIRA (v6.3.4#6332)