[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 16kb

2018-10-24 Thread Benjamin Roth (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662070#comment-16662070
 ] 

Benjamin Roth commented on CASSANDRA-13241:
---

Agreed

> Lower default chunk_length_in_kb from 64kb to 16kb
> --
>
> Key: CASSANDRA-13241
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13241
> Project: Cassandra
>  Issue Type: Wish
>  Components: Core
>Reporter: Benjamin Roth
>Assignee: Ariel Weisberg
>Priority: Major
> Attachments: CompactIntegerSequence.java, 
> CompactIntegerSequenceBench.java, CompactSummingIntegerSequence.java
>
>
> Having a too low chunk size may result in some wasted disk space. A too high 
> chunk size may lead to massive overreads and may have a critical impact on 
> overall system performance.
> In my case, the default chunk size lead to peak read IOs of up to 1GB/s and 
> avg reads of 200MB/s. After lowering chunksize (of course aligned with read 
> ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s.
> The risk of (physical) overreads is increasing with lower (page cache size) / 
> (total data size) ratio.
> High chunk sizes are mostly appropriate for bigger payloads pre request but 
> if the model consists rather of small rows or small resultsets, the read 
> overhead with 64kb chunk size is insanely high. This applies for example for 
> (small) skinny rows.
> Please also see here:
> https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY
> To give you some insights what a difference it can make (460GB data, 128GB 
> RAM):
> - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L
> - Disk throughput: https://cl.ly/2a0Z250S1M3c
> - This shows, that the request distribution remained the same, so no "dynamic 
> snitch magic": https://cl.ly/3E0t1T1z2c0J



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 16kb

2018-10-24 Thread Benjamin Roth (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661948#comment-16661948
 ] 

Benjamin Roth commented on CASSANDRA-13241:
---

4-8kb will not "destroy" the OS page cache. Linux Pages are 4kb by default, so 
4kb chunks perfectly fit into cache pages. Actually read-ahead will kill your 
performance if you have a lot of disk-reads going on. This can kill your page 
cache if your dataset is a lot larger than available memory and you are doing 
many random reads with small resultsets.

We use 4kb chunks and we observed a TREMENDOUS difference in IO reads when 
disabling read ahead completely. With default read ahead kernel settings, the 
physical read IO is roughly 20-30x in our use case, specifically it was like 
~20MB/s vs 600MB/s.

Sum-up: Not 4KB chunk size alone is the problem but all components have to be 
tuned and aligned to remove bottlenecks and make the whole system performant. 
The specific params always depend on the particular case.

> Lower default chunk_length_in_kb from 64kb to 16kb
> --
>
> Key: CASSANDRA-13241
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13241
> Project: Cassandra
>  Issue Type: Wish
>  Components: Core
>Reporter: Benjamin Roth
>Assignee: Ariel Weisberg
>Priority: Major
> Attachments: CompactIntegerSequence.java, 
> CompactIntegerSequenceBench.java, CompactSummingIntegerSequence.java
>
>
> Having a too low chunk size may result in some wasted disk space. A too high 
> chunk size may lead to massive overreads and may have a critical impact on 
> overall system performance.
> In my case, the default chunk size lead to peak read IOs of up to 1GB/s and 
> avg reads of 200MB/s. After lowering chunksize (of course aligned with read 
> ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s.
> The risk of (physical) overreads is increasing with lower (page cache size) / 
> (total data size) ratio.
> High chunk sizes are mostly appropriate for bigger payloads pre request but 
> if the model consists rather of small rows or small resultsets, the read 
> overhead with 64kb chunk size is insanely high. This applies for example for 
> (small) skinny rows.
> Please also see here:
> https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY
> To give you some insights what a difference it can make (460GB data, 128GB 
> RAM):
> - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L
> - Disk throughput: https://cl.ly/2a0Z250S1M3c
> - This shows, that the request distribution remained the same, so no "dynamic 
> snitch magic": https://cl.ly/3E0t1T1z2c0J



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13798) Disallow filtering on non-primary-key base column for MV

2017-08-25 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141663#comment-16141663
 ] 

Benjamin Roth commented on CASSANDRA-13798:
---

Even if the current implementation has known issues, you cannot kill that (or 
any other) feature just like that. As [~JoshuaMcKenzie] mentioned, how do you 
treat existing installations + schemas?
If I was affected (I really have to check this) this would either force me to 
change my schema or to be blocked on updates. Both is not viable if the current 
solution works for my needs. For example I am not really affected if I have an 
insert-only payload or if my data does not expire.

What you of course can do: 
Spit our a warning in the logs on bootstrap if the schema is affected and on 
schema changes that are affected and refer to a JIRA. So one can decide to stay 
with it or to migrate the schema / model to be not affected any more.

My 2 cents.

> Disallow filtering on non-primary-key base column for MV
> 
>
> Key: CASSANDRA-13798
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13798
> Project: Cassandra
>  Issue Type: Bug
>  Components: Materialized Views
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>
> We should probably consider disallow filtering conditions on non-primary-key 
> base column for Materialized View which is introduced in CASSANDRA-10368.
> The main problem is that the liveness of view row is now depending on 
> multiple base columns (multiple filtered non-pk base column + base column 
> used in view pk) and this semantic could not be properly support without 
> drastic storage format changes. (SEE CASSANDRA-11500, 
> [background|https://issues.apache.org/jira/browse/CASSANDRA-11500?focusedCommentId=16119823=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16119823])
> We should step back and re-consider the non-primary-key filtering feature 
> together with supporting multiple non-PK cols in MV clustering key in 
> CASSANDRA-10226.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13299) Potential OOMs and lock contention in write path streams

2017-08-21 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16134749#comment-16134749
 ] 

Benjamin Roth commented on CASSANDRA-13299:
---

Sorry for the late response, I was on vacation. No, I am not working on that 
ticket. But thanks a lot for your efforts (not only) on that ticket!

> Potential OOMs and lock contention in write path streams
> 
>
> Key: CASSANDRA-13299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13299
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>Assignee: ZhaoYang
>
> I see a potential OOM, when a stream (e.g. repair) goes through the write 
> path as it is with MVs.
> StreamReceiveTask gets a bunch of SSTableReaders. These produce rowiterators 
> and they again produce mutations. So every partition creates a single 
> mutation, which in case of (very) big partitions can result in (very) big 
> mutations. Those are created on heap and stay there until they finished 
> processing.
> I don't think it is necessary to create a single mutation for each partition. 
> Why don't we implement a PartitionUpdateGeneratorIterator that takes a 
> UnfilteredRowIterator and a max size and spits out PartitionUpdates to be 
> used to create and apply mutations?
> The max size should be something like min(reasonable_absolute_max_size, 
> max_mutation_size, commitlog_segment_size / 2). reasonable_absolute_max_size 
> could be like 16M or sth.
> A mutation shouldn't be too large as it also affects MV partition locking. 
> The longer a MV partition is locked during a stream, the higher chances are 
> that WTE's occur during streams.
> I could also imagine that a max number of updates per mutation regardless of 
> size in bytes could make sense to avoid lock contention.
> Love to get feedback and suggestions, incl. naming suggestions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13066) Fast streaming with materialized views

2017-07-10 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079906#comment-16079906
 ] 

Benjamin Roth commented on CASSANDRA-13066:
---

No. Go ahead!

> Fast streaming with materialized views
> --
>
> Key: CASSANDRA-13066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13066
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Materialized Views, Streaming and Messaging
>Reporter: Benjamin Roth
>Assignee: Benjamin Roth
> Fix For: 4.0
>
>
> I propose adding a configuration option to send streams of tables with MVs 
> not through the regular write path.
> This may be either a global option or better a CF option.
> Background:
> A repair of a CF with an MV that is much out of sync creates many streams. 
> These streams all go through the regular write path to assert local 
> consistency of the MV. This again causes a read before write for every single 
> mutation which again puts a lot of pressure on the node - much more than 
> simply streaming the SSTable down.
> In some cases this can be avoided. Instead of only repairing the base table, 
> all base + mv tables would have to be repaired. But this can break eventual 
> consistency between base table and MV. The proposed behaviour is always safe, 
> when having append-only MVs. It also works when using CL_QUORUM writes but it 
> cannot be absolutely guaranteed, that a quorum write is applied atomically, 
> so this can also lead to inconsistencies, if a quorum write is started but 
> one node dies in the middle of a request.
> So, this proposal can help a lot in some situations but also can break 
> consistency in others. That's why it should be left upon the operator if that 
> behaviour is appropriate for individual use cases.
> This issue came up here:
> https://issues.apache.org/jira/browse/CASSANDRA-12888?focusedCommentId=15736599=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15736599



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13464) Failed to create Materialized view with a specific token range

2017-06-25 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062535#comment-16062535
 ] 

Benjamin Roth commented on CASSANDRA-13464:
---

I personally don't see a use case. The token range in the mv does not relate to 
a predictable or contiguous range of data in the base table. So I don't know 
why I would like to do sth like that. If you want to partition your mvs into 
more tables the you should rather think of a different partition key from my 
point of view

> Failed to create Materialized view with a specific token range
> --
>
> Key: CASSANDRA-13464
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13464
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Natsumi Kojima
>Assignee: Krishna Dattu Koneru
>Priority: Minor
>  Labels: materializedviews
>
> Failed to create Materialized view with a specific token range.
> Example :
> {code:java}
> $ ccm create "MaterializedView" -v 3.0.13
> $ ccm populate  -n 3
> $ ccm start
> $ ccm status
> Cluster: 'MaterializedView'
> ---
> node1: UP
> node3: UP
> node2: UP
> $ccm node1 cqlsh
> Connected to MaterializedView at 127.0.0.1:9042.
> [cqlsh 5.0.1 | Cassandra 3.0.13 | CQL spec 3.4.0 | Native protocol v4]
> Use HELP for help.
> cqlsh> CREATE KEYSPACE test WITH replication = {'class':'SimpleStrategy', 
> 'replication_factor':3};
> cqlsh> CREATE TABLE test.test ( id text PRIMARY KEY , value1 text , value2 
> text, value3 text);
> $ccm node1 ring test 
> Datacenter: datacenter1
> ==
> AddressRackStatus State   LoadOwns
> Token
>   
> 3074457345618258602
> 127.0.0.1  rack1   Up Normal  64.86 KB100.00% 
> -9223372036854775808
> 127.0.0.2  rack1   Up Normal  86.49 KB100.00% 
> -3074457345618258603
> 127.0.0.3  rack1   Up Normal  89.04 KB100.00% 
> 3074457345618258602
> $ ccm node1 cqlsh
> cqlsh> INSERT INTO test.test (id, value1 , value2, value3 ) VALUES ('aaa', 
> 'aaa', 'aaa' ,'aaa');
> cqlsh> INSERT INTO test.test (id, value1 , value2, value3 ) VALUES ('bbb', 
> 'bbb', 'bbb' ,'bbb');
> cqlsh> SELECT token(id),id,value1 FROM test.test;
>  system.token(id) | id  | value1
> --+-+
>  -4737872923231490581 | aaa |aaa
>  -3071845237020185195 | bbb |bbb
> (2 rows)
> cqlsh> CREATE MATERIALIZED VIEW test.test_view AS SELECT value1, id FROM 
> test.test WHERE id IS NOT NULL AND value1 IS NOT NULL AND TOKEN(id) > 
> -9223372036854775808 AND TOKEN(id) < -3074457345618258603 PRIMARY KEY(value1, 
> id) WITH CLUSTERING ORDER BY (id ASC);
> ServerError: java.lang.ClassCastException: 
> org.apache.cassandra.cql3.TokenRelation cannot be cast to 
> org.apache.cassandra.cql3.SingleColumnRelation
> {code}
> Stacktrace :
> {code:java}
> INFO  [MigrationStage:1] 2017-04-19 18:32:48,131 ColumnFamilyStore.java:389 - 
> Initializing test.test
> WARN  [SharedPool-Worker-1] 2017-04-19 18:44:07,263 FBUtilities.java:337 - 
> Trigger directory doesn't exist, please create it and try again.
> ERROR [SharedPool-Worker-1] 2017-04-19 18:46:10,072 QueryMessage.java:128 - 
> Unexpected error during query
> java.lang.ClassCastException: org.apache.cassandra.cql3.TokenRelation cannot 
> be cast to org.apache.cassandra.cql3.SingleColumnRelation
>   at 
> org.apache.cassandra.db.view.View.relationsToWhereClause(View.java:275) 
> ~[apache-cassandra-3.0.13.jar:3.0.13]
>   at 
> org.apache.cassandra.cql3.statements.CreateViewStatement.announceMigration(CreateViewStatement.java:219)
>  ~[apache-cassandra-3.0.13.jar:3.0.13]
>   at 
> org.apache.cassandra.cql3.statements.SchemaAlteringStatement.execute(SchemaAlteringStatement.java:93)
>  ~[apache-cassandra-3.0.13.jar:3.0.13]
>   at 
> org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:206)
>  ~[apache-cassandra-3.0.13.jar:3.0.13]
>   at 
> org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:237) 
> ~[apache-cassandra-3.0.13.jar:3.0.13]
>   at 
> org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:222) 
> ~[apache-cassandra-3.0.13.jar:3.0.13]
>   at 
> org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115)
>  ~[apache-cassandra-3.0.13.jar:3.0.13]
>   at 
> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513)
>  [apache-cassandra-3.0.13.jar:3.0.13]
>   at 
> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407)
>  [apache-cassandra-3.0.13.jar:3.0.13]
>   at 
> 

[jira] [Comment Edited] (CASSANDRA-13127) Materialized Views: View row expires too soon

2017-05-09 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002850#comment-16002850
 ] 

Benjamin Roth edited comment on CASSANDRA-13127 at 5/9/17 3:18 PM:
---

[~zznate] I have never "stumbled upon" it but i was also never taking care of 
that. We also only use the default TTLs, so maybe this is a different thing.
Sounds like it is worth investigating on it. I think there should be a 
consensus what the expected behaviour should be (especially on partial 
updates), then some tests should be written and then the desired behaviour 
should be implemented if it is not met, yet.

Unfortunately I don't have the time at the moment to dig deep into that issue 
and go through all the details in the code to see what's going on here.
Just from reading the description of the issue it totally looks like a bug - at 
least from a user's point of view.

If nobody else is available for testing and debugging, maybe I can take a 
deeper look in 1-2 weeks.


was (Author: brstgt):
[~zznate] I have never "stumbled upon" it but i was also never taking care of 
that. We also only use the default TTLs, so maybe this is a different thing.
Sounds like it is worth investigating on it. I think there should be a 
consensus what the expected behaviour should be, then some tests should be 
written and then the desired behaviour should be implemented if it is not met, 
yet.

Unfortunately I don't have the time at the moment to dig deep into that issue 
and go through all the details in the code to see what's going on here.
Just from reading the description of the issue it totally looks like a bug - at 
least from a user's point of view.

> Materialized Views: View row expires too soon
> -
>
> Key: CASSANDRA-13127
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13127
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local Write-Read Paths, Materialized Views
>Reporter: Duarte Nunes
>
> Consider the following commands, ran against trunk:
> {code}
> echo "DROP MATERIALIZED VIEW ks.mv; DROP TABLE ks.base;" | bin/cqlsh
> echo "CREATE TABLE ks.base (p int, c int, v int, PRIMARY KEY (p, c));" | 
> bin/cqlsh
> echo "CREATE MATERIALIZED VIEW ks.mv AS SELECT p, c FROM base WHERE p IS NOT 
> NULL AND c IS NOT NULL PRIMARY KEY (c, p);" | bin/cqlsh
> echo "INSERT INTO ks.base (p, c) VALUES (0, 0) USING TTL 10;" | bin/cqlsh
> # wait for row liveness to get closer to expiration
> sleep 6;
> echo "UPDATE ks.base USING TTL 8 SET v = 0 WHERE p = 0 and c = 0;" | bin/cqlsh
> echo "SELECT p, c, ttl(v) FROM ks.base; SELECT * FROM ks.mv;" | bin/cqlsh
>  p | c | ttl(v)
> ---+---+
>  0 | 0 |  7
> (1 rows)
>  c | p
> ---+---
>  0 | 0
> (1 rows)
> # wait for row liveness to expire
> sleep 4;
> echo "SELECT p, c, ttl(v) FROM ks.base; SELECT * FROM ks.mv;" | bin/cqlsh
>  p | c | ttl(v)
> ---+---+
>  0 | 0 |  3
> (1 rows)
>  c | p
> ---+---
> (0 rows)
> {code}
> Notice how the view row is removed even though the base row is still live. I 
> would say this is because in ViewUpdateGenerator#computeLivenessInfoForEntry 
> the TTLs are compared instead of the expiration times, but I'm not sure I'm 
> getting that far ahead in the code when updating a column that's not in the 
> view.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13127) Materialized Views: View row expires too soon

2017-05-09 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002850#comment-16002850
 ] 

Benjamin Roth commented on CASSANDRA-13127:
---

[~zznate] I have never "stumbled upon" it but i was also never taking care of 
that. We also only use the default TTLs, so maybe this is a different thing.
Sounds like it is worth investigating on it. I think there should be a 
consensus what the expected behaviour should be, then some tests should be 
written and then the desired behaviour should be implemented if it is not met, 
yet.

Unfortunately I don't have the time at the moment to dig deep into that issue 
and go through all the details in the code to see what's going on here.
Just from reading the description of the issue it totally looks like a bug - at 
least from a user's point of view.

> Materialized Views: View row expires too soon
> -
>
> Key: CASSANDRA-13127
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13127
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local Write-Read Paths, Materialized Views
>Reporter: Duarte Nunes
>
> Consider the following commands, ran against trunk:
> {code}
> echo "DROP MATERIALIZED VIEW ks.mv; DROP TABLE ks.base;" | bin/cqlsh
> echo "CREATE TABLE ks.base (p int, c int, v int, PRIMARY KEY (p, c));" | 
> bin/cqlsh
> echo "CREATE MATERIALIZED VIEW ks.mv AS SELECT p, c FROM base WHERE p IS NOT 
> NULL AND c IS NOT NULL PRIMARY KEY (c, p);" | bin/cqlsh
> echo "INSERT INTO ks.base (p, c) VALUES (0, 0) USING TTL 10;" | bin/cqlsh
> # wait for row liveness to get closer to expiration
> sleep 6;
> echo "UPDATE ks.base USING TTL 8 SET v = 0 WHERE p = 0 and c = 0;" | bin/cqlsh
> echo "SELECT p, c, ttl(v) FROM ks.base; SELECT * FROM ks.mv;" | bin/cqlsh
>  p | c | ttl(v)
> ---+---+
>  0 | 0 |  7
> (1 rows)
>  c | p
> ---+---
>  0 | 0
> (1 rows)
> # wait for row liveness to expire
> sleep 4;
> echo "SELECT p, c, ttl(v) FROM ks.base; SELECT * FROM ks.mv;" | bin/cqlsh
>  p | c | ttl(v)
> ---+---+
>  0 | 0 |  3
> (1 rows)
>  c | p
> ---+---
> (0 rows)
> {code}
> Notice how the view row is removed even though the base row is still live. I 
> would say this is because in ViewUpdateGenerator#computeLivenessInfoForEntry 
> the TTLs are compared instead of the expiration times, but I'm not sure I'm 
> getting that far ahead in the code when updating a column that's not in the 
> view.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC

2017-05-08 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001349#comment-16001349
 ] 

Benjamin Roth commented on CASSANDRA-12888:
---

I am absolutely aware of that! That's why I also added some tests. All unit 
tests ran well so far. I also ran a bunch of probably related dtests like the 
MV test suite. It also looked good. Nevertheless, I don't want to urge you, 
take the time you need! I appreciate any feedback!

> Incremental repairs broken for MVs and CDC
> --
>
> Key: CASSANDRA-12888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Stefan Podkowinski
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.0.x, 3.11.x
>
>
> SSTables streamed during the repair process will first be written locally and 
> afterwards either simply added to the pool of existing sstables or, in case 
> of existing MVs or active CDC, replayed on mutation basis:
> As described in {{StreamReceiveTask.OnCompletionRunnable}}:
> {quote}
> We have a special path for views and for CDC.
> For views, since the view requires cleaning up any pre-existing state, we 
> must put all partitions through the same write path as normal mutations. This 
> also ensures any 2is are also updated.
> For CDC-enabled tables, we want to ensure that the mutations are run through 
> the CommitLog so they can be archived by the CDC process on discard.
> {quote}
> Using the regular write path turns out to be an issue for incremental 
> repairs, as we loose the {{repaired_at}} state in the process. Eventually the 
> streamed rows will end up in the unrepaired set, in contrast to the rows on 
> the sender site moved to the repaired set. The next repair run will stream 
> the same data back again, causing rows to bounce on and on between nodes on 
> each repair.
> See linked dtest on steps to reproduce. An example for reproducing this 
> manually using ccm can be found 
> [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC

2017-05-07 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000302#comment-16000302
 ] 

Benjamin Roth commented on CASSANDRA-12888:
---

[~pauloricardomg] Have you been able to take a look at the patch, yet? If not, 
maybe someone else wants to review it? It's there for 2 months now.

The patch introduces multiple (active) memtables per CF. This could also help 
in other situations like:
https://issues.apache.org/jira/browse/CASSANDRA-13290
https://issues.apache.org/jira/browse/CASSANDRA-12991


> Incremental repairs broken for MVs and CDC
> --
>
> Key: CASSANDRA-12888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Stefan Podkowinski
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.0.x, 3.11.x
>
>
> SSTables streamed during the repair process will first be written locally and 
> afterwards either simply added to the pool of existing sstables or, in case 
> of existing MVs or active CDC, replayed on mutation basis:
> As described in {{StreamReceiveTask.OnCompletionRunnable}}:
> {quote}
> We have a special path for views and for CDC.
> For views, since the view requires cleaning up any pre-existing state, we 
> must put all partitions through the same write path as normal mutations. This 
> also ensures any 2is are also updated.
> For CDC-enabled tables, we want to ensure that the mutations are run through 
> the CommitLog so they can be archived by the CDC process on discard.
> {quote}
> Using the regular write path turns out to be an issue for incremental 
> repairs, as we loose the {{repaired_at}} state in the process. Eventually the 
> streamed rows will end up in the unrepaired set, in contrast to the rows on 
> the sender site moved to the repaired set. The next repair run will stream 
> the same data back again, causing rows to bounce on and on between nodes on 
> each repair.
> See linked dtest on steps to reproduce. An example for reproducing this 
> manually using ccm can be found 
> [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13065) Skip building views during base table streams on range movements

2017-04-06 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960301#comment-15960301
 ] 

Benjamin Roth commented on CASSANDRA-13065:
---

[~pauloricardomg] thanks for the review! Would you like to take a look at 
CASSANDRA-13066 for review? It depends on this patch and I have to touch it 
anyway. So if you have comments on the concept and namings (and probably you 
will) I can do that in one run.

> Skip building views during base table streams on range movements
> 
>
> Key: CASSANDRA-13065
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13065
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 4.0
>
>
> Booting or decommisioning nodes with MVs is unbearably slow as all streams go 
> through the regular write paths. This causes read-before-writes for every 
> mutation and during bootstrap it causes them to be sent to batchlog.
> The makes it virtually impossible to boot a new node in an acceptable amount 
> of time.
> Using the regular streaming behaviour for consistent range movements works 
> much better in this case and does not break the MV local consistency contract.
> Already tested on own cluster.
> Bootstrap case is super easy to handle, decommission case requires 
> CASSANDRA-13064



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths

2017-04-06 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15959033#comment-15959033
 ] 

Benjamin Roth commented on CASSANDRA-13065:
---

Looks good. I like putting the requiresViewBuild property into the 
StreamOperation!

> Consistent range movements to not require MV updates to go through write 
> paths 
> ---
>
> Key: CASSANDRA-13065
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13065
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 4.0
>
>
> Booting or decommisioning nodes with MVs is unbearably slow as all streams go 
> through the regular write paths. This causes read-before-writes for every 
> mutation and during bootstrap it causes them to be sent to batchlog.
> The makes it virtually impossible to boot a new node in an acceptable amount 
> of time.
> Using the regular streaming behaviour for consistent range movements works 
> much better in this case and does not break the MV local consistency contract.
> Already tested on own cluster.
> Bootstrap case is super easy to handle, decommission case requires 
> CASSANDRA-13064



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths

2017-04-03 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954589#comment-15954589
 ] 

Benjamin Roth commented on CASSANDRA-13065:
---

[~pauloricardomg]

There you go. My first complete patch. I hope its ok and works :)
https://github.com/Jaumo/cassandra/commit/88699700feb6b9a504df88ff063b82641d7939f7

> Consistent range movements to not require MV updates to go through write 
> paths 
> ---
>
> Key: CASSANDRA-13065
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13065
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 4.0
>
>
> Booting or decommisioning nodes with MVs is unbearably slow as all streams go 
> through the regular write paths. This causes read-before-writes for every 
> mutation and during bootstrap it causes them to be sent to batchlog.
> The makes it virtually impossible to boot a new node in an acceptable amount 
> of time.
> Using the regular streaming behaviour for consistent range movements works 
> much better in this case and does not break the MV local consistency contract.
> Already tested on own cluster.
> Bootstrap case is super easy to handle, decommission case requires 
> CASSANDRA-13064



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths

2017-03-28 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945381#comment-15945381
 ] 

Benjamin Roth commented on CASSANDRA-13065:
---

[~pauloricardomg]

Gnaaa. Stupid search+replace errors.
Made changes according to your feedback: 
https://github.com/Jaumo/cassandra/commit/7ba773e901bcdb3bf830417ebd07ad4786a5b179
CDC uses write path again. Hope this time everything's ok.

Thanks for your patience :)

> Consistent range movements to not require MV updates to go through write 
> paths 
> ---
>
> Key: CASSANDRA-13065
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13065
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 4.0
>
>
> Booting or decommisioning nodes with MVs is unbearably slow as all streams go 
> through the regular write paths. This causes read-before-writes for every 
> mutation and during bootstrap it causes them to be sent to batchlog.
> The makes it virtually impossible to boot a new node in an acceptable amount 
> of time.
> Using the regular streaming behaviour for consistent range movements works 
> much better in this case and does not break the MV local consistency contract.
> Already tested on own cluster.
> Bootstrap case is super easy to handle, decommission case requires 
> CASSANDRA-13064



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths

2017-03-27 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942722#comment-15942722
 ] 

Benjamin Roth commented on CASSANDRA-13065:
---

[~pauloricardomg] Did you notice my update?

> Consistent range movements to not require MV updates to go through write 
> paths 
> ---
>
> Key: CASSANDRA-13065
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13065
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 4.0
>
>
> Booting or decommisioning nodes with MVs is unbearably slow as all streams go 
> through the regular write paths. This causes read-before-writes for every 
> mutation and during bootstrap it causes them to be sent to batchlog.
> The makes it virtually impossible to boot a new node in an acceptable amount 
> of time.
> Using the regular streaming behaviour for consistent range movements works 
> much better in this case and does not break the MV local consistency contract.
> Already tested on own cluster.
> Bootstrap case is super easy to handle, decommission case requires 
> CASSANDRA-13064



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13226) StreamPlan for incremental repairs flushing memtables unnecessarily

2017-03-23 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15938577#comment-15938577
 ] 

Benjamin Roth commented on CASSANDRA-13226:
---

That does not make sense to me. Why should be streamed more than requested? 
Sounds like waste of resources to me. Streaming more than a repair requires 
assumes that the system is still creating inconsistent data during the repair.

> StreamPlan for incremental repairs flushing memtables unnecessarily
> ---
>
> Key: CASSANDRA-13226
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13226
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Blake Eggleston
>Assignee: Blake Eggleston
>Priority: Minor
> Fix For: 4.0
>
>
> Since incremental repairs are run against a fixed dataset, there's no need to 
> flush memtables when streaming for them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13315) Consistency is confusing for new users

2017-03-09 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903459#comment-15903459
 ] 

Benjamin Roth commented on CASSANDRA-13315:
---

I had the same problems in the beginning, so generally +1. But IMHO this should 
go along with an explaining section in the official docs.

> Consistency is confusing for new users
> --
>
> Key: CASSANDRA-13315
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13315
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Ryan Svihla
>
> New users really struggle with consistency level and fall into a large number 
> of tarpits trying to decide on the right one.
> 1. There are a LOT of consistency levels and it's up to the end user to 
> reason about what combinations are valid and what is really what they intend 
> it to be. Is there any reason why write at ALL and read at CL TWO is better 
> than read at CL ONE? 
> 2. They require a good understanding of failure modes to do well. It's not 
> uncommon for people to use CL one and wonder why their data is missing.
> 3. The serial consistency level "bucket" is confusing to even write about and 
> easy to get wrong even for experienced users.
> So I propose the following steps (EDIT based on Jonathan's comment):
> 1. Remove the "serial consistency" level of consistency levels and just have 
> all consistency levels in one bucket to set, conditions still need to be 
> required for SERIAL/LOCAL_SERIAL
> 2. add 3 new consistency levels pointing to existing ones but that infer 
> intent much more cleanly:
>* EVENTUALLY_CONSISTENT = LOCAL_ONE reads and writes
>* HIGHLY_CONSISTENT = LOCAL_QUORUM reads and writes
>* TRANSACTIONALLY_CONSISTENT = LOCAL_SERIAL reads and writes
> for global levels of this I propose keeping the old ones around, they're 
> rarely used in the field except by accident or particularly opinionated and 
> advanced users.
> Drivers should put the new consistency levels in a new package and docs 
> should be updated to suggest their use. Likewise setting default CL should 
> only provide those three settings and applying it for reads and writes at the 
> same time.
> CQLSH I'm gonna suggest should default to HIGHLY_CONSISTENT. New sysadmins 
> get surprised by this frequently and I can think of a couple very major 
> escalations because people were confused what the default behavior was.
> The benefit to all this change is we shrink the surface area that one has to 
> understand when learning Cassandra greatly, and we have far less bad initial 
> experiences and surprises. New users will more likely be able to wrap their 
> brains around those 3 ideas more readily then they can "what happens when I 
> have RF2, QUROUM writes and ONE reads". Advanced users get access to all the 
> way still, while new users don't have to learn all the ins and outs of 
> distributed theory just to write data and be able to read it back.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12489) consecutive repairs of same range always finds 'out of sync' in sane cluster

2017-03-08 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900951#comment-15900951
 ] 

Benjamin Roth commented on CASSANDRA-12489:
---

Awesome. I see that there has been a lot of work done by thelastpickle since I 
initially forked it from spotifiy (which didn't support 3.x back then).
A simple changelog.md would be even more awesome to see if there have been 
important changes.

> consecutive repairs of same range always finds 'out of sync' in sane cluster
> 
>
> Key: CASSANDRA-12489
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12489
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Benjamin Roth
>Assignee: Benjamin Roth
>  Labels: lhf
> Attachments: trace_3_10.1.log.gz, trace_3_10.2.log.gz, 
> trace_3_10.3.log.gz, trace_3_10.4.log.gz, trace_3_9.1.log.gz, 
> trace_3_9.2.log.gz
>
>
> No matter how often or when I run the same subrange repair, it ALWAYS tells 
> me that some ranges are our of sync. Tested in 3.9 + 3.10 (git trunk of 
> 2016-08-17). The cluster is sane. All nodes are up, cluster is not overloaded.
> I guess this is not a desired behaviour. I'd expect that a repair does what 
> it says and a consecutive repair shouldn't report "out of syncs" any more if 
> the cluster is sane.
> Especially for tables with MVs that puts a lot of pressure during repair as 
> ranges are repaired over and over again.
> See traces of different runs attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12489) consecutive repairs of same range always finds 'out of sync' in sane cluster

2017-03-06 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898051#comment-15898051
 ] 

Benjamin Roth commented on CASSANDRA-12489:
---

Thanks for the answer. Thats what I thought. But what a right to exist do 
incremental repairs then have in the real world if (most, many, whatever) 
people use a tool that makes repairs manageable which eliminates this case. The 
use case + real benefit is quite limited then, isn't it?
Probably thats a philosophic question but I'm curios what other guys think 
about it and if I am maybe missing a valuable use case.

> consecutive repairs of same range always finds 'out of sync' in sane cluster
> 
>
> Key: CASSANDRA-12489
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12489
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Benjamin Roth
>Assignee: Benjamin Roth
>  Labels: lhf
> Attachments: trace_3_10.1.log.gz, trace_3_10.2.log.gz, 
> trace_3_10.3.log.gz, trace_3_10.4.log.gz, trace_3_9.1.log.gz, 
> trace_3_9.2.log.gz
>
>
> No matter how often or when I run the same subrange repair, it ALWAYS tells 
> me that some ranges are our of sync. Tested in 3.9 + 3.10 (git trunk of 
> 2016-08-17). The cluster is sane. All nodes are up, cluster is not overloaded.
> I guess this is not a desired behaviour. I'd expect that a repair does what 
> it says and a consecutive repair shouldn't report "out of syncs" any more if 
> the cluster is sane.
> Especially for tables with MVs that puts a lot of pressure during repair as 
> ranges are repaired over and over again.
> See traces of different runs attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12489) consecutive repairs of same range always finds 'out of sync' in sane cluster

2017-03-06 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898028#comment-15898028
 ] 

Benjamin Roth commented on CASSANDRA-12489:
---

May I ask what's the reason that incremental + subrange repair doesn't do 
anticompaction? Is it because anticompaction is too expensive in this case or 
to say it in different words: A subrange full repair is cheaper than subrange 
incremental repair with anticompaction?

> consecutive repairs of same range always finds 'out of sync' in sane cluster
> 
>
> Key: CASSANDRA-12489
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12489
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Benjamin Roth
>Assignee: Benjamin Roth
>  Labels: lhf
> Attachments: trace_3_10.1.log.gz, trace_3_10.2.log.gz, 
> trace_3_10.3.log.gz, trace_3_10.4.log.gz, trace_3_9.1.log.gz, 
> trace_3_9.2.log.gz
>
>
> No matter how often or when I run the same subrange repair, it ALWAYS tells 
> me that some ranges are our of sync. Tested in 3.9 + 3.10 (git trunk of 
> 2016-08-17). The cluster is sane. All nodes are up, cluster is not overloaded.
> I guess this is not a desired behaviour. I'd expect that a repair does what 
> it says and a consecutive repair shouldn't report "out of syncs" any more if 
> the cluster is sane.
> Especially for tables with MVs that puts a lot of pressure during repair as 
> ranges are repaired over and over again.
> See traces of different runs attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13303) CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky

2017-03-06 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897969#comment-15897969
 ] 

Benjamin Roth commented on CASSANDRA-13303:
---

My trunk is older. So I close the ticket

> CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super 
> flaky
> ---
>
> Key: CASSANDRA-13303
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13303
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Benjamin Roth
>
> On my machine, this test succeeds maybe 1 out of 10 times.
> Cause seems to be that sstable is not elected for compation in 
> worthDroppingTombstones as droppableRatio is 0.0
> I don't know the primary intention of this test, so I didn't touch it but the 
> conditions are not safe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (CASSANDRA-13303) CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky

2017-03-06 Thread Benjamin Roth (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Roth resolved CASSANDRA-13303.
---
Resolution: Duplicate

Duplicate to CASSANDRA-13038 regression fixes

> CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super 
> flaky
> ---
>
> Key: CASSANDRA-13303
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13303
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Benjamin Roth
>
> On my machine, this test succeeds maybe 1 out of 10 times.
> Cause seems to be that sstable is not elected for compation in 
> worthDroppingTombstones as droppableRatio is 0.0
> I don't know the primary intention of this test, so I didn't touch it but the 
> conditions are not safe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13303) CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky

2017-03-06 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897966#comment-15897966
 ] 

Benjamin Roth commented on CASSANDRA-13303:
---

I read the comments in 13038, I guess we are talking of the same thing. Also 
MetadataSerializerTest fails on my machine.

> CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super 
> flaky
> ---
>
> Key: CASSANDRA-13303
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13303
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Benjamin Roth
>
> On my machine, this test succeeds maybe 1 out of 10 times.
> Cause seems to be that sstable is not elected for compation in 
> worthDroppingTombstones as droppableRatio is 0.0
> I don't know the primary intention of this test, so I didn't touch it but the 
> conditions are not safe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13303) CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky

2017-03-06 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897956#comment-15897956
 ] 

Benjamin Roth commented on CASSANDRA-13303:
---

CASSANDRA-13038 is fixed in commit a5ce963117acf5e4cf0a31057551f2f42385c398 
which I have in my trunk and I don't see a newer commit for 13038.

> CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super 
> flaky
> ---
>
> Key: CASSANDRA-13303
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13303
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Benjamin Roth
>
> On my machine, this test succeeds maybe 1 out of 10 times.
> Cause seems to be that sstable is not elected for compation in 
> worthDroppingTombstones as droppableRatio is 0.0
> I don't know the primary intention of this test, so I didn't touch it but the 
> conditions are not safe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (CASSANDRA-13303) CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky

2017-03-06 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897942#comment-15897942
 ] 

Benjamin Roth edited comment on CASSANDRA-13303 at 3/6/17 7:58 PM:
---

1. Happens in trunk
2. Maybe not clear enough: Table is simply not compacted as 
AbstractCompactionStrategy.worthDroppingTombstones returns false as 
droppableRatio is 0.0 

{code}
double droppableRatio = sstable.getEstimatedDroppableTombstoneRatio(gcBefore);
{code}

Message:
"should be less than x but was y"


was (Author: brstgt):
1. Happens in trunk
2. Maybe not clear enough: Table is simply not compacted as 
AbstractCompactionStrategy.worthDroppingTombstones returns false as 
droppableRatio is 0.0 

{code}
double droppableRatio = sstable.getEstimatedDroppableTombstoneRatio(gcBefore);
{code}

> CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super 
> flaky
> ---
>
> Key: CASSANDRA-13303
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13303
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Benjamin Roth
>
> On my machine, this test succeeds maybe 1 out of 10 times.
> Cause seems to be that sstable is not elected for compation in 
> worthDroppingTombstones as droppableRatio is 0.0
> I don't know the primary intention of this test, so I didn't touch it but the 
> conditions are not safe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13303) CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky

2017-03-06 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897942#comment-15897942
 ] 

Benjamin Roth commented on CASSANDRA-13303:
---

1. Happens in trunk
2. Maybe not clear enough: Table is simply not compacted as 
AbstractCompactionStrategy.worthDroppingTombstones returns false as 
droppableRatio is 0.0 

{code}
double droppableRatio = sstable.getEstimatedDroppableTombstoneRatio(gcBefore);
{code}

> CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super 
> flaky
> ---
>
> Key: CASSANDRA-13303
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13303
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Benjamin Roth
>
> On my machine, this test succeeds maybe 1 out of 10 times.
> Cause seems to be that sstable is not elected for compation in 
> worthDroppingTombstones as droppableRatio is 0.0
> I don't know the primary intention of this test, so I didn't touch it but the 
> conditions are not safe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC

2017-03-06 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897917#comment-15897917
 ] 

Benjamin Roth commented on CASSANDRA-12888:
---

Review would be much appreciated. Don't know if @pauloricardomg still wants to 
do the review. Please give me some feedback, thanks!

> Incremental repairs broken for MVs and CDC
> --
>
> Key: CASSANDRA-12888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Stefan Podkowinski
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.0.x, 3.11.x
>
>
> SSTables streamed during the repair process will first be written locally and 
> afterwards either simply added to the pool of existing sstables or, in case 
> of existing MVs or active CDC, replayed on mutation basis:
> As described in {{StreamReceiveTask.OnCompletionRunnable}}:
> {quote}
> We have a special path for views and for CDC.
> For views, since the view requires cleaning up any pre-existing state, we 
> must put all partitions through the same write path as normal mutations. This 
> also ensures any 2is are also updated.
> For CDC-enabled tables, we want to ensure that the mutations are run through 
> the CommitLog so they can be archived by the CDC process on discard.
> {quote}
> Using the regular write path turns out to be an issue for incremental 
> repairs, as we loose the {{repaired_at}} state in the process. Eventually the 
> streamed rows will end up in the unrepaired set, in contrast to the rows on 
> the sender site moved to the repaired set. The next repair run will stream 
> the same data back again, causing rows to bounce on and on between nodes on 
> each repair.
> See linked dtest on steps to reproduce. An example for reproducing this 
> manually using ccm can be found 
> [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC

2017-03-06 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897917#comment-15897917
 ] 

Benjamin Roth edited comment on CASSANDRA-12888 at 3/6/17 7:48 PM:
---

Review would be much appreciated. Don't know if [~pauloricardomg] still wants 
to do the review. Please give me some feedback, thanks!


was (Author: brstgt):
Review would be much appreciated. Don't know if @pauloricardomg still wants to 
do the review. Please give me some feedback, thanks!

> Incremental repairs broken for MVs and CDC
> --
>
> Key: CASSANDRA-12888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Stefan Podkowinski
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.0.x, 3.11.x
>
>
> SSTables streamed during the repair process will first be written locally and 
> afterwards either simply added to the pool of existing sstables or, in case 
> of existing MVs or active CDC, replayed on mutation basis:
> As described in {{StreamReceiveTask.OnCompletionRunnable}}:
> {quote}
> We have a special path for views and for CDC.
> For views, since the view requires cleaning up any pre-existing state, we 
> must put all partitions through the same write path as normal mutations. This 
> also ensures any 2is are also updated.
> For CDC-enabled tables, we want to ensure that the mutations are run through 
> the CommitLog so they can be archived by the CDC process on discard.
> {quote}
> Using the regular write path turns out to be an issue for incremental 
> repairs, as we loose the {{repaired_at}} state in the process. Eventually the 
> streamed rows will end up in the unrepaired set, in contrast to the rows on 
> the sender site moved to the repaired set. The next repair run will stream 
> the same data back again, causing rows to bounce on and on between nodes on 
> each repair.
> See linked dtest on steps to reproduce. An example for reproducing this 
> manually using ccm can be found 
> [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC

2017-03-06 Thread Benjamin Roth (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Roth updated CASSANDRA-12888:
--
Status: Patch Available  (was: Awaiting Feedback)

https://github.com/apache/cassandra/compare/trunk...Jaumo:CASSANDRA-12888

Some dtest assertions:
https://github.com/riptano/cassandra-dtest/compare/master...Jaumo:CASSANDRA-12888?expand=1

> Incremental repairs broken for MVs and CDC
> --
>
> Key: CASSANDRA-12888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Stefan Podkowinski
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.0.x, 3.11.x
>
>
> SSTables streamed during the repair process will first be written locally and 
> afterwards either simply added to the pool of existing sstables or, in case 
> of existing MVs or active CDC, replayed on mutation basis:
> As described in {{StreamReceiveTask.OnCompletionRunnable}}:
> {quote}
> We have a special path for views and for CDC.
> For views, since the view requires cleaning up any pre-existing state, we 
> must put all partitions through the same write path as normal mutations. This 
> also ensures any 2is are also updated.
> For CDC-enabled tables, we want to ensure that the mutations are run through 
> the CommitLog so they can be archived by the CDC process on discard.
> {quote}
> Using the regular write path turns out to be an issue for incremental 
> repairs, as we loose the {{repaired_at}} state in the process. Eventually the 
> streamed rows will end up in the unrepaired set, in contrast to the rows on 
> the sender site moved to the repaired set. The next repair run will stream 
> the same data back again, causing rows to bounce on and on between nodes on 
> each repair.
> See linked dtest on steps to reproduce. An example for reproducing this 
> manually using ccm can be found 
> [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (CASSANDRA-13303) CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky

2017-03-06 Thread Benjamin Roth (JIRA)
Benjamin Roth created CASSANDRA-13303:
-

 Summary: 
CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky
 Key: CASSANDRA-13303
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13303
 Project: Cassandra
  Issue Type: Bug
Reporter: Benjamin Roth


On my machine, this test succeeds maybe 1 out of 10 times.

Cause seems to be that sstable is not elected for compation in 
worthDroppingTombstones as droppableRatio is 0.0

I don't know the primary intention of this test, so I didn't touch it but the 
conditions are not safe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC

2017-03-06 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897589#comment-15897589
 ] 

Benjamin Roth commented on CASSANDRA-12888:
---

Btw.:
My concept seems to work, but there is one question left:
Why does a StreamSession create unrepaired SSTables?
IncomingFileMessage
 => Creates RangeAwareSSTableWriter:97
 => cfs.createSSTableMultiWriter
...
 => CompactionStrategyManager.createSSTableMultiWriter:185

Will it be marked as repaired later? If so, where/when?

Why I ask:
The received SSTable has the repairedFlag in RangeAwareSSTableWriter and it's 
header but it is lost when the SSTable is finished and returned as 
SSTableReader.

> Incremental repairs broken for MVs and CDC
> --
>
> Key: CASSANDRA-12888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Stefan Podkowinski
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.0.x, 3.11.x
>
>
> SSTables streamed during the repair process will first be written locally and 
> afterwards either simply added to the pool of existing sstables or, in case 
> of existing MVs or active CDC, replayed on mutation basis:
> As described in {{StreamReceiveTask.OnCompletionRunnable}}:
> {quote}
> We have a special path for views and for CDC.
> For views, since the view requires cleaning up any pre-existing state, we 
> must put all partitions through the same write path as normal mutations. This 
> also ensures any 2is are also updated.
> For CDC-enabled tables, we want to ensure that the mutations are run through 
> the CommitLog so they can be archived by the CDC process on discard.
> {quote}
> Using the regular write path turns out to be an issue for incremental 
> repairs, as we loose the {{repaired_at}} state in the process. Eventually the 
> streamed rows will end up in the unrepaired set, in contrast to the rows on 
> the sender site moved to the repaired set. The next repair run will stream 
> the same data back again, causing rows to bounce on and on between nodes on 
> each repair.
> See linked dtest on steps to reproduce. An example for reproducing this 
> manually using ccm can be found 
> [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC

2017-03-06 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897565#comment-15897565
 ] 

Benjamin Roth commented on CASSANDRA-12888:
---

I also had this idea but it wont work. It will totally break base <> MV 
consistency. Except: You lock all involved partitions for the whole process. 
But that would create insanely long locks and a extremely high contention

> Incremental repairs broken for MVs and CDC
> --
>
> Key: CASSANDRA-12888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Stefan Podkowinski
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.0.x, 3.11.x
>
>
> SSTables streamed during the repair process will first be written locally and 
> afterwards either simply added to the pool of existing sstables or, in case 
> of existing MVs or active CDC, replayed on mutation basis:
> As described in {{StreamReceiveTask.OnCompletionRunnable}}:
> {quote}
> We have a special path for views and for CDC.
> For views, since the view requires cleaning up any pre-existing state, we 
> must put all partitions through the same write path as normal mutations. This 
> also ensures any 2is are also updated.
> For CDC-enabled tables, we want to ensure that the mutations are run through 
> the CommitLog so they can be archived by the CDC process on discard.
> {quote}
> Using the regular write path turns out to be an issue for incremental 
> repairs, as we loose the {{repaired_at}} state in the process. Eventually the 
> streamed rows will end up in the unrepaired set, in contrast to the rows on 
> the sender site moved to the repaired set. The next repair run will stream 
> the same data back again, causing rows to bounce on and on between nodes on 
> each repair.
> See linked dtest on steps to reproduce. An example for reproducing this 
> manually using ccm can be found 
> [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (CASSANDRA-13299) Potential OOMs and lock contention in write path streams

2017-03-05 Thread Benjamin Roth (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Roth updated CASSANDRA-13299:
--
Description: 
I see a potential OOM, when a stream (e.g. repair) goes through the write path 
as it is with MVs.

StreamReceiveTask gets a bunch of SSTableReaders. These produce rowiterators 
and they again produce mutations. So every partition creates a single mutation, 
which in case of (very) big partitions can result in (very) big mutations. 
Those are created on heap and stay there until they finished processing.

I don't think it is necessary to create a single mutation for each partition. 
Why don't we implement a PartitionUpdateGeneratorIterator that takes a 
UnfilteredRowIterator and a max size and spits out PartitionUpdates to be used 
to create and apply mutations?
The max size should be something like min(reasonable_absolute_max_size, 
max_mutation_size, commitlog_segment_size / 2). reasonable_absolute_max_size 
could be like 16M or sth.
A mutation shouldn't be too large as it also affects MV partition locking. The 
longer a MV partition is locked during a stream, the higher chances are that 
WTE's occur during streams.
I could also imagine that a max number of updates per mutation regardless of 
size in bytes could make sense to avoid lock contention.

Love to get feedback and suggestions, incl. naming suggestions.


  was:
I see a potential OOM, when a stream (e.g. repair) goes through the write path 
as it is with MVs.

StreamReceiveTask gets a bunch of SSTableReaders. These produce rowiterators 
and they again produce mutations. So every partition creates a single mutation, 
which in case of (very) big partitions can result in (very) big mutations. 
Those are created on heap and stay there until they are processed.

I don't think it is necessary to create a single mutation for each partition. 
Why don't we implement a PartitionUpdateGeneratorIterator that takes a 
UnfilteredRowIterator and a max size and spits out PartitionUpdates to be used 
to create and apply mutations?
The max size should be something like min(reasonable_absolute_max_size, 
max_mutation_size, commitlog_segment_size / 2). reasonable_absolute_max_size 
could be like 16M or sth.
A mutation shouldn't be too large as it also affects MV partition locking. As 
longer a MV partition is locked during a stream, the higher chances are that 
WTE's occur during streams.
I could also imagine that a max number of updates per mutation regardless of 
size in bytes could make sense to avoid lock contention.

Love to get feedback and suggestions, incl. naming suggestions.



> Potential OOMs and lock contention in write path streams
> 
>
> Key: CASSANDRA-13299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13299
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>
> I see a potential OOM, when a stream (e.g. repair) goes through the write 
> path as it is with MVs.
> StreamReceiveTask gets a bunch of SSTableReaders. These produce rowiterators 
> and they again produce mutations. So every partition creates a single 
> mutation, which in case of (very) big partitions can result in (very) big 
> mutations. Those are created on heap and stay there until they finished 
> processing.
> I don't think it is necessary to create a single mutation for each partition. 
> Why don't we implement a PartitionUpdateGeneratorIterator that takes a 
> UnfilteredRowIterator and a max size and spits out PartitionUpdates to be 
> used to create and apply mutations?
> The max size should be something like min(reasonable_absolute_max_size, 
> max_mutation_size, commitlog_segment_size / 2). reasonable_absolute_max_size 
> could be like 16M or sth.
> A mutation shouldn't be too large as it also affects MV partition locking. 
> The longer a MV partition is locked during a stream, the higher chances are 
> that WTE's occur during streams.
> I could also imagine that a max number of updates per mutation regardless of 
> size in bytes could make sense to avoid lock contention.
> Love to get feedback and suggestions, incl. naming suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13299) Potential OOMs and lock contention in write path streams

2017-03-05 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896460#comment-15896460
 ] 

Benjamin Roth commented on CASSANDRA-13299:
---

Relating to CASSANDRA-11670 this would also allow to write all streamed 
mutations to commitlog without problems.
I also propose to do so with small streams (see CASSANDRA-13290). Writing small 
streams (e.g. < 100KB) to commitlog does not require a flush at the end of 
stream receive. This avoids tons of flushes if tons of tiny streams are sent 
during a repair session.

These are maybe apples and oranges but fixing all these ends makes the whole 
process less error prone and probably perform better.

> Potential OOMs and lock contention in write path streams
> 
>
> Key: CASSANDRA-13299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13299
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>
> I see a potential OOM, when a stream (e.g. repair) goes through the write 
> path as it is with MVs.
> StreamReceiveTask gets a bunch of SSTableReaders. These produce rowiterators 
> and they again produce mutations. So every partition creates a single 
> mutation, which in case of (very) big partitions can result in (very) big 
> mutations. Those are created on heap and stay there until they are processed.
> I don't think it is necessary to create a single mutation for each partition. 
> Why don't we implement a PartitionUpdateGeneratorIterator that takes a 
> UnfilteredRowIterator and a max size and spits out PartitionUpdates to be 
> used to create and apply mutations?
> The max size should be something like min(reasonable_absolute_max_size, 
> max_mutation_size, commitlog_segment_size / 2). reasonable_absolute_max_size 
> could be like 16M or sth.
> A mutation shouldn't be too large as it also affects MV partition locking. As 
> longer a MV partition is locked during a stream, the higher chances are that 
> WTE's occur during streams.
> I could also imagine that a max number of updates per mutation regardless of 
> size in bytes could make sense to avoid lock contention.
> Love to get feedback and suggestions, incl. naming suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (CASSANDRA-13299) Potential OOMs and lock contention in write path streams

2017-03-05 Thread Benjamin Roth (JIRA)
Benjamin Roth created CASSANDRA-13299:
-

 Summary: Potential OOMs and lock contention in write path streams
 Key: CASSANDRA-13299
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13299
 Project: Cassandra
  Issue Type: Improvement
Reporter: Benjamin Roth


I see a potential OOM, when a stream (e.g. repair) goes through the write path 
as it is with MVs.

StreamReceiveTask gets a bunch of SSTableReaders. These produce rowiterators 
and they again produce mutations. So every partition creates a single mutation, 
which in case of (very) big partitions can result in (very) big mutations. 
Those are created on heap and stay there until they are processed.

I don't think it is necessary to create a single mutation for each partition. 
Why don't we implement a PartitionUpdateGeneratorIterator that takes a 
UnfilteredRowIterator and a max size and spits out PartitionUpdates to be used 
to create and apply mutations?
The max size should be something like min(reasonable_absolute_max_size, 
max_mutation_size, commitlog_segment_size / 2). reasonable_absolute_max_size 
could be like 16M or sth.
A mutation shouldn't be too large as it also affects MV partition locking. As 
longer a MV partition is locked during a stream, the higher chances are that 
WTE's occur during streams.
I could also imagine that a max number of updates per mutation regardless of 
size in bytes could make sense to avoid lock contention.

Love to get feedback and suggestions, incl. naming suggestions.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC

2017-03-05 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896363#comment-15896363
 ] 

Benjamin Roth commented on CASSANDRA-12888:
---

Just perfect! Thats EXACTLY what I wanted to know and it helps me to continue 
to work on that ticket. I started some proof of concept work but it still needs 
some finalizing and exhaustive testing.

Concept is quite simple in theory (hopefully in reality, too):
Each SSTable may now contain more than one active memtable each for unrepaired 
and repaired data (like compaction pools). The repaired memtable does not have 
to be resident all the time, only during repairs, so my intention was to create 
it on demand and not to automatically re-create on after flush. To make things 
simple for a start my intention was to apply flush behaviour to both memtables. 
Either both or none is flushed. Maybe this could be optimized in future.


> Incremental repairs broken for MVs and CDC
> --
>
> Key: CASSANDRA-12888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Stefan Podkowinski
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.0.x, 3.11.x
>
>
> SSTables streamed during the repair process will first be written locally and 
> afterwards either simply added to the pool of existing sstables or, in case 
> of existing MVs or active CDC, replayed on mutation basis:
> As described in {{StreamReceiveTask.OnCompletionRunnable}}:
> {quote}
> We have a special path for views and for CDC.
> For views, since the view requires cleaning up any pre-existing state, we 
> must put all partitions through the same write path as normal mutations. This 
> also ensures any 2is are also updated.
> For CDC-enabled tables, we want to ensure that the mutations are run through 
> the CommitLog so they can be archived by the CDC process on discard.
> {quote}
> Using the regular write path turns out to be an issue for incremental 
> repairs, as we loose the {{repaired_at}} state in the process. Eventually the 
> streamed rows will end up in the unrepaired set, in contrast to the rows on 
> the sender site moved to the repaired set. The next repair run will stream 
> the same data back again, causing rows to bounce on and on between nodes on 
> each repair.
> See linked dtest on steps to reproduce. An example for reproducing this 
> manually using ccm can be found 
> [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC

2017-03-03 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895140#comment-15895140
 ] 

Benjamin Roth commented on CASSANDRA-12888:
---

Maybe my former question perished:

What effect does the repairedAt flag have for future repairs except that a 
non-zero value means, that a table has been repaired at some time?
I am happy about any code references.

> Incremental repairs broken for MVs and CDC
> --
>
> Key: CASSANDRA-12888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Stefan Podkowinski
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.0.x, 3.11.x
>
>
> SSTables streamed during the repair process will first be written locally and 
> afterwards either simply added to the pool of existing sstables or, in case 
> of existing MVs or active CDC, replayed on mutation basis:
> As described in {{StreamReceiveTask.OnCompletionRunnable}}:
> {quote}
> We have a special path for views and for CDC.
> For views, since the view requires cleaning up any pre-existing state, we 
> must put all partitions through the same write path as normal mutations. This 
> also ensures any 2is are also updated.
> For CDC-enabled tables, we want to ensure that the mutations are run through 
> the CommitLog so they can be archived by the CDC process on discard.
> {quote}
> Using the regular write path turns out to be an issue for incremental 
> repairs, as we loose the {{repaired_at}} state in the process. Eventually the 
> streamed rows will end up in the unrepaired set, in contrast to the rows on 
> the sender site moved to the repaired set. The next repair run will stream 
> the same data back again, causing rows to bounce on and on between nodes on 
> each repair.
> See linked dtest on steps to reproduce. An example for reproducing this 
> manually using ccm can be found 
> [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC

2017-03-03 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894645#comment-15894645
 ] 

Benjamin Roth commented on CASSANDRA-12888:
---

For detailed explanation an excerpt from that discussion:
-

... there are still possible scenarios where it's possible to break consistency 
by repairing the base and the view separately even with QUORUM writes:

Initial state:

Base replica A: {k0=v0, ts=0}
Base replica B: {k0=v0, ts=0}
Base replica C: {k0=v0, ts=0}
View paired replica A: {v1=k0, ts=0}
View paired replica B: {v0=k0, ts=0}
View paired replica C: {v0=k0, ts=0}

Base replica A receives write {k1=v1, ts=1}, propagates to view paired replica 
A and dies.

Current state is:
Base replica A: {k1=v1, ts=1}
Base replica B: {k0=v0, ts=0}
Base replica C: {k0=v0, ts=0}
View paired replica A: {v1=k1, ts=1}
View paired replica B: {v0=k0, ts=0}
View paired replica C: {v0=k0, ts=0}

Base replica B and C receives write {k2=v2, ts=2}, write to their paired 
replica. Write is successful at QUORUM.

Current state is:
Base replica A: {k1=v1, ts=1}
Base replica B: {k2=v2, ts=2}
Base replica C: {k2=v2, ts=2}
View paired replica A: {v1=k1, ts=1}
View paired replica B: {v2=k2, ts=2}
View paired replica C: {v2=k2, ts=2}

A returns from the dead. Repair base table:
Base replica A: {k2=v2, ts=2}
Base replica B: {k2=v2, ts=2}
Base replica C: {k2=v2, ts=2}

Repair MV:
View paired replica A: {v1=k1, ts=1} and {v2=k2, ts=2}
View paired replica B: {v1=k1, ts=1} and {v2=k2, ts=2}
View paired replica C: {v1=k1, ts=1} and {v2=k2, ts=2}

So, this requires replica A to generate a tombstone for {v1=k1, ts=1} during 
repair of base table.

> Incremental repairs broken for MVs and CDC
> --
>
> Key: CASSANDRA-12888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Stefan Podkowinski
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.0.x, 3.11.x
>
>
> SSTables streamed during the repair process will first be written locally and 
> afterwards either simply added to the pool of existing sstables or, in case 
> of existing MVs or active CDC, replayed on mutation basis:
> As described in {{StreamReceiveTask.OnCompletionRunnable}}:
> {quote}
> We have a special path for views and for CDC.
> For views, since the view requires cleaning up any pre-existing state, we 
> must put all partitions through the same write path as normal mutations. This 
> also ensures any 2is are also updated.
> For CDC-enabled tables, we want to ensure that the mutations are run through 
> the CommitLog so they can be archived by the CDC process on discard.
> {quote}
> Using the regular write path turns out to be an issue for incremental 
> repairs, as we loose the {{repaired_at}} state in the process. Eventually the 
> streamed rows will end up in the unrepaired set, in contrast to the rows on 
> the sender site moved to the repaired set. The next repair run will stream 
> the same data back again, causing rows to bounce on and on between nodes on 
> each repair.
> See linked dtest on steps to reproduce. An example for reproducing this 
> manually using ccm can be found 
> [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC

2017-03-03 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894639#comment-15894639
 ] 

Benjamin Roth commented on CASSANDRA-12888:
---

A repair must go through the write path expect for some special cases. I also 
first had the idea to avoid it completely but in discussion with 
[~pauloricardomg] it turned out that this may introduce inconsistencies that 
these could only be fixed by a view rebuild because it leaves stale rows.
I know that all this stuff is totally counter-intuitive but just streaming 
"blindly" all sstables (incl. MV tables) down is not correct. This is why I am 
trying to improve the mutation based approached.

If the Sstables for MVs get corrupted or lost, the only way to fix it is to 
rebuild that view again. There is no way (at least none I see atm) that would 
consistenly repair a view from other nodes.

The underlying principle is:
- A view must always be consistent to its base-table
- A view does not have to be consistent among nodes, thats handled by repairing 
the base table

Thats also why you don't have to run a repair before building a view. 
Nevertheless it would not help anyway because you NEVER have a 100% guaranteed 
consistent state. A repair only guarantees consistency until the point of 
repair.

The "know what you are doing" option is offered by CASSANDRA-13066 btw. 
In this ticket I also adopted the election of CFs (tables + mvs) when doing a 
keyspace repair depending if the MV is repaired by stream or by mutation.

> Incremental repairs broken for MVs and CDC
> --
>
> Key: CASSANDRA-12888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Stefan Podkowinski
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.0.x, 3.11.x
>
>
> SSTables streamed during the repair process will first be written locally and 
> afterwards either simply added to the pool of existing sstables or, in case 
> of existing MVs or active CDC, replayed on mutation basis:
> As described in {{StreamReceiveTask.OnCompletionRunnable}}:
> {quote}
> We have a special path for views and for CDC.
> For views, since the view requires cleaning up any pre-existing state, we 
> must put all partitions through the same write path as normal mutations. This 
> also ensures any 2is are also updated.
> For CDC-enabled tables, we want to ensure that the mutations are run through 
> the CommitLog so they can be archived by the CDC process on discard.
> {quote}
> Using the regular write path turns out to be an issue for incremental 
> repairs, as we loose the {{repaired_at}} state in the process. Eventually the 
> streamed rows will end up in the unrepaired set, in contrast to the rows on 
> the sender site moved to the repaired set. The next repair run will stream 
> the same data back again, causing rows to bounce on and on between nodes on 
> each repair.
> See linked dtest on steps to reproduce. An example for reproducing this 
> manually using ccm can be found 
> [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-11670) Rebuilding or streaming MV generates mutations larger than max_mutation_size_in_kb

2017-03-03 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894128#comment-15894128
 ] 

Benjamin Roth commented on CASSANDRA-11670:
---

Hmmm I guess this possibly breaks consistency in repair streams + MVs
In StreamReceiveTasks mutations are applied without commitlog because cf is 
flushed at the end. But MVs are not flushed.

Either:
Also flush all MVs at the end of the stream task - it is not said that this is 
actually required for all MVs as we do not know where view replica updates 
eventually go.

Or:
Enable commitlog for view replica updates even if base table does not commit 
log writes.

> Rebuilding or streaming MV generates mutations larger than 
> max_mutation_size_in_kb
> --
>
> Key: CASSANDRA-11670
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11670
> Project: Cassandra
>  Issue Type: Bug
>  Components: Configuration, Streaming and Messaging
>Reporter: Anastasia Osintseva
>Assignee: Paulo Motta
> Fix For: 3.0.10, 3.10
>
>
> I have in cluster 2 DC, in each DC - 2 Nodes. I wanted to add 1 node to each 
> DC. One node has been added successfully after I had made scrubing. 
> Now I'm trying to add node to another DC, but get error: 
> org.apache.cassandra.streaming.StreamException: Stream failed. 
> After scrubing and repair I get the same error.  
> {noformat}
> ERROR [StreamReceiveTask:5] 2016-04-27 00:33:21,082 Keyspace.java:492 - 
> Unknown exception caught while attempting to update MaterializedView! 
> messages_dump.messages
> java.lang.IllegalArgumentException: Mutation of 34974901 bytes is too large 
> for the maxiumum size of 33554432
>   at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:264) 
> ~[apache-cassandra-3.0.5.jar:3.0.5]
>   at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:469) 
> [apache-cassandra-3.0.5.jar:3.0.5]
>   at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:384) 
> [apache-cassandra-3.0.5.jar:3.0.5]
>   at org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:205) 
> [apache-cassandra-3.0.5.jar:3.0.5]
>   at org.apache.cassandra.db.Mutation.apply(Mutation.java:217) 
> [apache-cassandra-3.0.5.jar:3.0.5]
>   at 
> org.apache.cassandra.batchlog.BatchlogManager.store(BatchlogManager.java:146) 
> ~[apache-cassandra-3.0.5.jar:3.0.5]
>   at 
> org.apache.cassandra.service.StorageProxy.mutateMV(StorageProxy.java:724) 
> ~[apache-cassandra-3.0.5.jar:3.0.5]
>   at 
> org.apache.cassandra.db.view.ViewManager.pushViewReplicaUpdates(ViewManager.java:149)
>  ~[apache-cassandra-3.0.5.jar:3.0.5]
>   at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:487) 
> [apache-cassandra-3.0.5.jar:3.0.5]
>   at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:384) 
> [apache-cassandra-3.0.5.jar:3.0.5]
>   at org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:205) 
> [apache-cassandra-3.0.5.jar:3.0.5]
>   at org.apache.cassandra.db.Mutation.apply(Mutation.java:217) 
> [apache-cassandra-3.0.5.jar:3.0.5]
>   at org.apache.cassandra.db.Mutation.applyUnsafe(Mutation.java:236) 
> [apache-cassandra-3.0.5.jar:3.0.5]
>   at 
> org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:169)
>  [apache-cassandra-3.0.5.jar:3.0.5]
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [na:1.8.0_11]
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> [na:1.8.0_11]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [na:1.8.0_11]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [na:1.8.0_11]
>   at java.lang.Thread.run(Thread.java:745) [na:1.8.0_11]
> ERROR [StreamReceiveTask:5] 2016-04-27 00:33:21,082 
> StreamReceiveTask.java:214 - Error applying streamed data: 
> java.lang.IllegalArgumentException: Mutation of 34974901 bytes is too large 
> for the maxiumum size of 33554432
>   at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:264) 
> ~[apache-cassandra-3.0.5.jar:3.0.5]
>   at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:469) 
> ~[apache-cassandra-3.0.5.jar:3.0.5]
>   at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:384) 
> ~[apache-cassandra-3.0.5.jar:3.0.5]
>   at org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:205) 
> ~[apache-cassandra-3.0.5.jar:3.0.5]
>   at org.apache.cassandra.db.Mutation.apply(Mutation.java:217) 
> ~[apache-cassandra-3.0.5.jar:3.0.5]
>   at 
> org.apache.cassandra.batchlog.BatchlogManager.store(BatchlogManager.java:146) 
> ~[apache-cassandra-3.0.5.jar:3.0.5]
>   at 
> 

[jira] [Comment Edited] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC

2017-03-03 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818955#comment-15818955
 ] 

Benjamin Roth edited comment on CASSANDRA-12888 at 3/3/17 9:57 AM:
---

Hi Victor,

We use MVs in Production with billions of records without known data loss.
Painful + slow refers to repairs and range movements (e.g. bootstrap +
decommission). Also (as mentioned in this ticket) incremental repairs dont
work, so full repair creates some overhead. Until 3.10 there are bugs
leading to write timeouts, even to NPEs and completely blocked mutation
stages. This could even bring your cluster down. In 3.10 some issues have
been resolved - actually we use a patched trunk version which is 1-2 months
old.

Depending on your model, MVs can help a lot from a developer perspective.
Some cases are very resource intensive to manage without MVs, requiring
distributed locks and/or CAS.
For append-only workloads, it may be simpler to NOT use MVs at the moment.
They aren't very complex and MVs wont help that much compared to the
problems that may raise with them.

Painful scenarios: There is no recipe for that. You may or may not
encounter performance issues, depending on your model and your workload.
I'd recommend not to use MVs that use a different partition key on the MV
than on the base table as this requires inter-node communication for EVERY
write operation. So you can easily kill your cluster with bulk operations
(like in streaming).

At the moment our cluster runs stable but it took months to find all the
bottlenecks, race conditions, resume from failures and so on. So my
recommendation: You can get it work but you need time and you should not
start with critical data, at least if it is not backed by another stable
storage. And you should use 3.10 when it is finally released or build your
own version from trunk. I would not recommend to use < 3.10 for MVs.

Btw.: Our own patched version does some dirty tricks, that may lead to
inconsistencies in some situations but we prefer some possible
inconsistencies (we can deal with) over performance bottlenecks. I created
several tickets to improve MV performance in some streaming situations but
it will take some time to really improve that situation.

Does this answer your question?


was (Author: brstgt):
Hi Victor,

We use MVs in Production with billions of records without known data loss.
Painful + slow refers to repairs and range movements (e.g. bootstrap +
decommission). Also (as mentioned in this ticket) incremental repairs dont
work, so full repair creates some overhead. Until 3.10 there are bugs
leading to write timeouts, even to NPEs and completely blocked mutation
stages. This could even bring your cluster down. In 3.10 some issues have
been resolved - actually we use a patched trunk version which is 1-2 months
old.

Depending on your model, MVs can help a lot from a developer perspective.
Some cases are very resource intensive to manage without MVs, requiring
distributed locks and/or CAS.
For append-only workloads, it may be simpler to NOT use MVs at the moment.
They aren't very complex and MVs wont help that much compared to the
problems that may raise with them.

Painful scenarios: There is no recipe for that. You may or may not
encounter performance issues, depending on your model and your workload.
I'd recommend not to use MVs that use a different partition key on the MV
than on the base table as this requires inter-node communication for EVERY
write operation. So you can easily kill your cluster with bulk operations
(like in streaming).

At the moment our cluster runs stable but it took months to find all the
bottlenecks, race conditions, resume from failures and so on. So my
recommendation: You can get it work but you need time and you should not
start with critical data, at least if it is not backed by another stable
storage. And you should use 3.10 when it is finally released or build your
own version from trunk. I would not recommend to use < 3.10 for MVs.

Btw.: Our own patched version does some dirty tricks, that may lead to
inconsistencies in some situations but we prefer some possible
inconsistencies (we can deal with) over performance bottlenecks. I created
several tickets to improve MV performance in some streaming situations but
it will take some time to really improve that situation.

Does this answer your question?






-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer


> Incremental repairs broken for MVs and CDC
> --
>
> Key: CASSANDRA-12888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and 

[jira] [Comment Edited] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC

2017-03-03 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894006#comment-15894006
 ] 

Benjamin Roth edited comment on CASSANDRA-12888 at 3/3/17 9:55 AM:
---

I am about to hack a proof of concept for that issue.

Concept:
Each mutation and each partition update have a "repairedAt" flag. This will be 
passed along through the whole write path like MV updates and serialization for 
remote MV updates. Then repair + non-repair mutations have to be separated in 
memtables and flushed to separate SSTables. From what I can see it should be 
easier to maintain a memtable each for repaired and non-repaired data than 
tracking the repair state within a memtable.
Passing repair state to replicas isn't even necessary as replicas should not be 
repaired directly anyway, so no need for a repairedAt state.

My question is:
How important is the exact value of "repairedAt". Is it possible to merge 
updates with different repair timestamps into a single memtable and finally 
flush them to an SSTable with repairedAt set to the latest or earliest 
repairedAt timestamps of all mutations in the memtable?
Or would that produce repair-inconsistencies or sth?

Any feedback?


was (Author: brstgt):
I am about to hack a proof of concept for that issue.

Concept:
Each mutation and each partition update have a "repairedAt" flag. This will be 
passed along through the whole write path like MV updates and serialization for 
remote MV updates. Then repair + non-repair mutations have to be separated in 
memtables and flushed to separate SSTables. From what I can see it should be 
easier to maintain a memtable each for repaired and non-repaired data than 
tracking the repair state within a memtable.

My question is:
How important is the exact value of "repairedAt". Is it possible to merge 
updates with different repair timestamps into a single memtable and finally 
flush them to an SSTable with repairedAt set to the latest or earliest 
repairedAt timestamps of all mutations in the memtable?
Or would that produce repair-inconsistencies or sth?

Any feedback?

> Incremental repairs broken for MVs and CDC
> --
>
> Key: CASSANDRA-12888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Stefan Podkowinski
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.0.x, 3.11.x
>
>
> SSTables streamed during the repair process will first be written locally and 
> afterwards either simply added to the pool of existing sstables or, in case 
> of existing MVs or active CDC, replayed on mutation basis:
> As described in {{StreamReceiveTask.OnCompletionRunnable}}:
> {quote}
> We have a special path for views and for CDC.
> For views, since the view requires cleaning up any pre-existing state, we 
> must put all partitions through the same write path as normal mutations. This 
> also ensures any 2is are also updated.
> For CDC-enabled tables, we want to ensure that the mutations are run through 
> the CommitLog so they can be archived by the CDC process on discard.
> {quote}
> Using the regular write path turns out to be an issue for incremental 
> repairs, as we loose the {{repaired_at}} state in the process. Eventually the 
> streamed rows will end up in the unrepaired set, in contrast to the rows on 
> the sender site moved to the repaired set. The next repair run will stream 
> the same data back again, causing rows to bounce on and on between nodes on 
> each repair.
> See linked dtest on steps to reproduce. An example for reproducing this 
> manually using ccm can be found 
> [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC

2017-03-03 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894006#comment-15894006
 ] 

Benjamin Roth commented on CASSANDRA-12888:
---

I am about to hack a proof of concept for that issue.

Concept:
Each mutation and each partition update have a "repairedAt" flag. This will be 
passed along through the whole write path like MV updates and serialization for 
remote MV updates. Then repair + non-repair mutations have to be separated in 
memtables and flushed to separate SSTables. From what I can see it should be 
easier to maintain a memtable each for repaired and non-repaired data than 
tracking the repair state within a memtable.

My question is:
How important is the exact value of "repairedAt". Is it possible to merge 
updates with different repair timestamps into a single memtable and finally 
flush them to an SSTable with repairedAt set to the latest or earliest 
repairedAt timestamps of all mutations in the memtable?
Or would that produce repair-inconsistencies or sth?

Any feedback?

> Incremental repairs broken for MVs and CDC
> --
>
> Key: CASSANDRA-12888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Stefan Podkowinski
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.0.x, 3.11.x
>
>
> SSTables streamed during the repair process will first be written locally and 
> afterwards either simply added to the pool of existing sstables or, in case 
> of existing MVs or active CDC, replayed on mutation basis:
> As described in {{StreamReceiveTask.OnCompletionRunnable}}:
> {quote}
> We have a special path for views and for CDC.
> For views, since the view requires cleaning up any pre-existing state, we 
> must put all partitions through the same write path as normal mutations. This 
> also ensures any 2is are also updated.
> For CDC-enabled tables, we want to ensure that the mutations are run through 
> the CommitLog so they can be archived by the CDC process on discard.
> {quote}
> Using the regular write path turns out to be an issue for incremental 
> repairs, as we loose the {{repaired_at}} state in the process. Eventually the 
> streamed rows will end up in the unrepaired set, in contrast to the rows on 
> the sender site moved to the repaired set. The next repair run will stream 
> the same data back again, causing rows to bounce on and on between nodes on 
> each repair.
> See linked dtest on steps to reproduce. An example for reproducing this 
> manually using ccm can be found 
> [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (CASSANDRA-13066) Fast streaming with materialized views

2017-03-03 Thread Benjamin Roth (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Roth updated CASSANDRA-13066:
--
Fix Version/s: 4.0
   Status: Patch Available  (was: Open)

Depends upon 13064+13065

https://github.com/Jaumo/cassandra/commits/CASSANDRA-13066

> Fast streaming with materialized views
> --
>
> Key: CASSANDRA-13066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13066
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>Assignee: Benjamin Roth
> Fix For: 4.0
>
>
> I propose adding a configuration option to send streams of tables with MVs 
> not through the regular write path.
> This may be either a global option or better a CF option.
> Background:
> A repair of a CF with an MV that is much out of sync creates many streams. 
> These streams all go through the regular write path to assert local 
> consistency of the MV. This again causes a read before write for every single 
> mutation which again puts a lot of pressure on the node - much more than 
> simply streaming the SSTable down.
> In some cases this can be avoided. Instead of only repairing the base table, 
> all base + mv tables would have to be repaired. But this can break eventual 
> consistency between base table and MV. The proposed behaviour is always safe, 
> when having append-only MVs. It also works when using CL_QUORUM writes but it 
> cannot be absolutely guaranteed, that a quorum write is applied atomically, 
> so this can also lead to inconsistencies, if a quorum write is started but 
> one node dies in the middle of a request.
> So, this proposal can help a lot in some situations but also can break 
> consistency in others. That's why it should be left upon the operator if that 
> behaviour is appropriate for individual use cases.
> This issue came up here:
> https://issues.apache.org/jira/browse/CASSANDRA-12888?focusedCommentId=15736599=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15736599



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (CASSANDRA-13293) MV read-before-write can be omitted for some operations

2017-03-02 Thread Benjamin Roth (JIRA)
Benjamin Roth created CASSANDRA-13293:
-

 Summary: MV read-before-write can be omitted for some operations
 Key: CASSANDRA-13293
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13293
 Project: Cassandra
  Issue Type: Improvement
Reporter: Benjamin Roth


A view that has the same fields in the primary key as its base table (i call it 
a congruent key), does not require read-before-writes except:

- Range deletes
- Partition deletes

If the view uses filters on non-pk columns either a rbw is required or a write 
that does not match the filter has to be turned into a delete. In doubt I'd 
stay with the current behaviour and to a rbw.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13290) Optimizing very small repair streams

2017-03-02 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892981#comment-15892981
 ] 

Benjamin Roth commented on CASSANDRA-13290:
---

While CASSANDRA-8911 brings a very interesting approach into the game, however 
the solution is rather complex (as can be seen in stalled ticket activity).

I guess both 12888 and this ticket are lower hanging fruits for a start, 
whereas I don't say it's not worth working on both approaches.

> Optimizing very small repair streams
> 
>
> Key: CASSANDRA-13290
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13290
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>
> I often encountered repair scenarios, where a lot of tiny repair streams were 
> created. This results in hundrets, thousands or even ten-thousands super 
> small SSTables (some bytes to some kbytes).
> This puts a lot of pressure on compaction and may even lead to a crash due to 
> too many open files - I also encountered this.
> What could help to avoid this:
> After CASSANDRA-12888 is resolved, a tiny stream (e.g. < 100kb) could be sent 
> through the write path to be buffered by memtables instead of creating an 
> SSTable each.
> Without CASSANDRA-12888 this would break incremental repairs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (CASSANDRA-13290) Optimizing very small repair streams

2017-03-02 Thread Benjamin Roth (JIRA)
Benjamin Roth created CASSANDRA-13290:
-

 Summary: Optimizing very small repair streams
 Key: CASSANDRA-13290
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13290
 Project: Cassandra
  Issue Type: Improvement
Reporter: Benjamin Roth


I often encountered repair scenarios, where a lot of tiny repair streams were 
created. This results in hundrets, thousands or even ten-thousands super small 
SSTables (some bytes to some kbytes).
This puts a lot of pressure on compaction and may even lead to a crash due to 
too many open files - I also encountered this.

What could help to avoid this:
After CASSANDRA-12888 is resolved, a tiny stream (e.g. < 100kb) could be sent 
through the write path to be buffered by memtables instead of creating an 
SSTable each.

Without CASSANDRA-12888 this would break incremental repairs.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (CASSANDRA-12985) Update MV repair documentation

2017-03-02 Thread Benjamin Roth (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Roth resolved CASSANDRA-12985.
---
Resolution: Resolved

MV repairs won't be changed as proposed

> Update MV repair documentation
> --
>
> Key: CASSANDRA-12985
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12985
> Project: Cassandra
>  Issue Type: Task
>Reporter: Benjamin Roth
> Fix For: 3.0.x, 3.11.x
>
>
> Due to CASSANDRA-12888 the way MVs are being repaired changes.
> Before:
> MV has been repaired by repairing the base table. Repairing the MV separately 
> has been discouraged. Also repairing a whole KS containing a MV has been 
> discouraged.
> After:
> MVs are treated like any other table in repairs. They also MUST be repaired 
> as any other table. Base table does NOT repair MV any more.
> Repairing a whole keyspace is encouraged.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb

2017-03-02 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892143#comment-15892143
 ] 

Benjamin Roth commented on CASSANDRA-13241:
---

So... who's gonna do it?

> Lower default chunk_length_in_kb from 64kb to 4kb
> -
>
> Key: CASSANDRA-13241
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13241
> Project: Cassandra
>  Issue Type: Wish
>  Components: Core
>Reporter: Benjamin Roth
>
> Having a too low chunk size may result in some wasted disk space. A too high 
> chunk size may lead to massive overreads and may have a critical impact on 
> overall system performance.
> In my case, the default chunk size lead to peak read IOs of up to 1GB/s and 
> avg reads of 200MB/s. After lowering chunksize (of course aligned with read 
> ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s.
> The risk of (physical) overreads is increasing with lower (page cache size) / 
> (total data size) ratio.
> High chunk sizes are mostly appropriate for bigger payloads pre request but 
> if the model consists rather of small rows or small resultsets, the read 
> overhead with 64kb chunk size is insanely high. This applies for example for 
> (small) skinny rows.
> Please also see here:
> https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY
> To give you some insights what a difference it can make (460GB data, 128GB 
> RAM):
> - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L
> - Disk throughput: https://cl.ly/2a0Z250S1M3c
> - This shows, that the request distribution remained the same, so no "dynamic 
> snitch magic": https://cl.ly/3E0t1T1z2c0J



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths

2017-03-02 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892040#comment-15892040
 ] 

Benjamin Roth edited comment on CASSANDRA-13065 at 3/2/17 12:06 PM:


[~pauloricardomg] Please also look at follow-up commits on review. I added 2 
more commits (for 13064+13065) with tiny fixes. Depending on your feedback, I 
can rearrange them if required.

https://github.com/Jaumo/cassandra/tree/CASSANDRA-13064


was (Author: brstgt):
[~pauloricardomg] Please also look at follow-up commits on review. I added 2 
more commits (13064+13065) with tiny fixes. Depending on your feedback, I can 
rearrange them if required.

> Consistent range movements to not require MV updates to go through write 
> paths 
> ---
>
> Key: CASSANDRA-13065
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13065
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 4.0
>
>
> Booting or decommisioning nodes with MVs is unbearably slow as all streams go 
> through the regular write paths. This causes read-before-writes for every 
> mutation and during bootstrap it causes them to be sent to batchlog.
> The makes it virtually impossible to boot a new node in an acceptable amount 
> of time.
> Using the regular streaming behaviour for consistent range movements works 
> much better in this case and does not break the MV local consistency contract.
> Already tested on own cluster.
> Bootstrap case is super easy to handle, decommission case requires 
> CASSANDRA-13064



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths

2017-03-02 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892040#comment-15892040
 ] 

Benjamin Roth commented on CASSANDRA-13065:
---

[~pauloricardomg] Please also look at follow-up commits on review. I added 2 
more commits (13064+13065) with tiny fixes. Depending on your feedback, I can 
rearrange them if required.

> Consistent range movements to not require MV updates to go through write 
> paths 
> ---
>
> Key: CASSANDRA-13065
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13065
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 4.0
>
>
> Booting or decommisioning nodes with MVs is unbearably slow as all streams go 
> through the regular write paths. This causes read-before-writes for every 
> mutation and during bootstrap it causes them to be sent to batchlog.
> The makes it virtually impossible to boot a new node in an acceptable amount 
> of time.
> Using the regular streaming behaviour for consistent range movements works 
> much better in this case and does not break the MV local consistency contract.
> Already tested on own cluster.
> Bootstrap case is super easy to handle, decommission case requires 
> CASSANDRA-13064



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13279) Table default settings file

2017-03-01 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890143#comment-15890143
 ] 

Benjamin Roth commented on CASSANDRA-13279:
---

I think [~slebresne] is basically right and I think transparency for users 
(meaning better docs at a central place) + better defaults are actually more 
important than convenience. If convenience introduces potential problems, then 
it's not really worth it.
I am also ok with closing it.

> Table default settings file
> ---
>
> Key: CASSANDRA-13279
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13279
> Project: Cassandra
>  Issue Type: Wish
>  Components: Configuration
>Reporter: Romain Hardouin
>Priority: Minor
>  Labels: config, documentation
>
> Following CASSANDRA-13241 we often see that there is no one-size-fits-all 
> value for settings. We can't find a sweet spot for every use cases.
> It's true for settings in cassandra.yaml but as [~brstgt] said for 
> {{chunk_length_in_kb}}: "this is somewhat hidden for the average user". 
> Many table settings are somewhat hidden for the average user. Some people 
> will think RTFM but if a file - say tables.yaml - contains default values for 
> table settings, more people would pay attention to them. And of course this 
> file could contain useful comments and guidance. 
> Example with SSTable compression options:
> {code}
> # General comments about sstable compression
> compression:
> # First of all: explain what is it. We split each SSTable into chunks, 
> etc.
> # Explain when users should lower this value (e.g. 4) or when a higher 
> value like 64 or 128 are recommended.
> # Explain the trade-off between read latency and off-heap compression 
> metadata size.
> chunk_length_in_kb: 16
> 
> # List of available compressor: LZ4Compressor, SnappyCompressor, and 
> DeflateCompressor
> # Explain trade-offs, some specific use cases (e.g. archives), etc.
> class: 'LZ4Compressor'
> 
> # If you want to disable compression by default, uncomment the following 
> line
> #enabled: false
> {code}
> So instead of hard coded values we would end up with something like 
> TableConfig + TableDescriptor à la Config + DatabaseDescriptor.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (CASSANDRA-13066) Fast streaming with materialized views

2017-03-01 Thread Benjamin Roth (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Roth reassigned CASSANDRA-13066:
-

Assignee: Benjamin Roth

> Fast streaming with materialized views
> --
>
> Key: CASSANDRA-13066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13066
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>Assignee: Benjamin Roth
>
> I propose adding a configuration option to send streams of tables with MVs 
> not through the regular write path.
> This may be either a global option or better a CF option.
> Background:
> A repair of a CF with an MV that is much out of sync creates many streams. 
> These streams all go through the regular write path to assert local 
> consistency of the MV. This again causes a read before write for every single 
> mutation which again puts a lot of pressure on the node - much more than 
> simply streaming the SSTable down.
> In some cases this can be avoided. Instead of only repairing the base table, 
> all base + mv tables would have to be repaired. But this can break eventual 
> consistency between base table and MV. The proposed behaviour is always safe, 
> when having append-only MVs. It also works when using CL_QUORUM writes but it 
> cannot be absolutely guaranteed, that a quorum write is applied atomically, 
> so this can also lead to inconsistencies, if a quorum write is started but 
> one node dies in the middle of a request.
> So, this proposal can help a lot in some situations but also can break 
> consistency in others. That's why it should be left upon the operator if that 
> behaviour is appropriate for individual use cases.
> This issue came up here:
> https://issues.apache.org/jira/browse/CASSANDRA-12888?focusedCommentId=15736599=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15736599



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (CASSANDRA-13066) Fast streaming with materialized views

2017-03-01 Thread Benjamin Roth (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Roth updated CASSANDRA-13066:
--
Summary: Fast streaming with materialized views  (was: Fast repair with 
materialized views)

> Fast streaming with materialized views
> --
>
> Key: CASSANDRA-13066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13066
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>
> I propose adding a configuration option to send streams of tables with MVs 
> not through the regular write path.
> This may be either a global option or better a CF option.
> Background:
> A repair of a CF with an MV that is much out of sync creates many streams. 
> These streams all go through the regular write path to assert local 
> consistency of the MV. This again causes a read before write for every single 
> mutation which again puts a lot of pressure on the node - much more than 
> simply streaming the SSTable down.
> In some cases this can be avoided. Instead of only repairing the base table, 
> all base + mv tables would have to be repaired. But this can break eventual 
> consistency between base table and MV. The proposed behaviour is always safe, 
> when having append-only MVs. It also works when using CL_QUORUM writes but it 
> cannot be absolutely guaranteed, that a quorum write is applied atomically, 
> so this can also lead to inconsistencies, if a quorum write is started but 
> one node dies in the middle of a request.
> So, this proposal can help a lot in some situations but also can break 
> consistency in others. That's why it should be left upon the operator if that 
> behaviour is appropriate for individual use cases.
> This issue came up here:
> https://issues.apache.org/jira/browse/CASSANDRA-12888?focusedCommentId=15736599=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15736599



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths

2017-02-28 Thread Benjamin Roth (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Roth updated CASSANDRA-13065:
--
Fix Version/s: 4.0

> Consistent range movements to not require MV updates to go through write 
> paths 
> ---
>
> Key: CASSANDRA-13065
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13065
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 4.0
>
>
> Booting or decommisioning nodes with MVs is unbearably slow as all streams go 
> through the regular write paths. This causes read-before-writes for every 
> mutation and during bootstrap it causes them to be sent to batchlog.
> The makes it virtually impossible to boot a new node in an acceptable amount 
> of time.
> Using the regular streaming behaviour for consistent range movements works 
> much better in this case and does not break the MV local consistency contract.
> Already tested on own cluster.
> Bootstrap case is super easy to handle, decommission case requires 
> CASSANDRA-13064



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths

2017-02-28 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889675#comment-15889675
 ] 

Benjamin Roth commented on CASSANDRA-13065:
---

[~pauloricardomg] This is the follow-up to CASSANDRA-13064.

I also optimized behaviour for CDC if no writepath is required due to MVs. This 
will allow incremental repairs for CFs with CDC without MVs.

> Consistent range movements to not require MV updates to go through write 
> paths 
> ---
>
> Key: CASSANDRA-13065
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13065
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 4.0
>
>
> Booting or decommisioning nodes with MVs is unbearably slow as all streams go 
> through the regular write paths. This causes read-before-writes for every 
> mutation and during bootstrap it causes them to be sent to batchlog.
> The makes it virtually impossible to boot a new node in an acceptable amount 
> of time.
> Using the regular streaming behaviour for consistent range movements works 
> much better in this case and does not break the MV local consistency contract.
> Already tested on own cluster.
> Bootstrap case is super easy to handle, decommission case requires 
> CASSANDRA-13064



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths

2017-02-28 Thread Benjamin Roth (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Roth updated CASSANDRA-13065:
--
Status: Patch Available  (was: Open)

https://github.com/Jaumo/cassandra/commit/95a215e4f9c46e62580dcd4f638c80d3cf9716db

> Consistent range movements to not require MV updates to go through write 
> paths 
> ---
>
> Key: CASSANDRA-13065
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13065
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>Assignee: Benjamin Roth
>Priority: Critical
>
> Booting or decommisioning nodes with MVs is unbearably slow as all streams go 
> through the regular write paths. This causes read-before-writes for every 
> mutation and during bootstrap it causes them to be sent to batchlog.
> The makes it virtually impossible to boot a new node in an acceptable amount 
> of time.
> Using the regular streaming behaviour for consistent range movements works 
> much better in this case and does not break the MV local consistency contract.
> Already tested on own cluster.
> Bootstrap case is super easy to handle, decommission case requires 
> CASSANDRA-13064



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13064) Add stream type or purpose to stream plan / stream

2017-02-28 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889672#comment-15889672
 ] 

Benjamin Roth commented on CASSANDRA-13064:
---

[~pauloricardomg] Would you like to take a look at my patch? For a start I only 
replaced Stream descriptions by a discreet enum. It's the easiest refactoring 
to not break compatibility with existing serialization.

If you want you can also take a look at the next commit which belongs to 
CASSANDRA-13065

> Add stream type or purpose to stream plan / stream
> --
>
> Key: CASSANDRA-13064
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13064
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>Assignee: Benjamin Roth
> Fix For: 4.0
>
>
> It would be very good to know the type or purpose of a certain stream on the 
> receiver side. It should be both available in a stream request and a stream 
> task.
> Why?
> It would be helpful to distinguish the purpose to allow different handling of 
> streams and requests. Examples:
> - In stream request a global flush is done. This is not necessary for all 
> types of streams. A repair stream(-plan) does not require a flush as this has 
> been done shortly before in validation compaction and only the sstables that 
> have been validated also have to be streamed.
> - In StreamReceiveTask streams for MVs go through the regular write path this 
> is painfully slow especially on bootstrap and decomission. Both for bootstrap 
> and decommission this is not necessary. Sstables can be directly streamed 
> down in this case. Handling bootstrap is no problem as it relies on a local 
> state but during decommission, the decom-state is bound to the sender and not 
> the receiver, so the receiver has to know that it is safe to stream that 
> sstable directly, not through the write-path. Thats why we have to know the 
> purpose of the stream.
> I'd love to implement this on my own but I am not sure how not to break the 
> streaming protocol for backwards compat or if it is ok to do so.
> Furthermore I'd love to get some feedback on that idea and some proposals 
> what stream types to distinguish. I could imagine:
> - bootstrap
> - decommission
> - repair
> - replace node
> - remove node
> - range relocation
> Comments like this support my idea, knowing the purpose could avoid this.
> {quote}
> // TODO each call to transferRanges re-flushes, this is 
> potentially a lot of waste
> streamPlan.transferRanges(newEndpoint, preferred, 
> keyspaceName, ranges);
> {quote}
> And alternative to passing the purpose of the stream was to pass flags like:
> - requiresFlush
> - requiresWritePathForMaterializedView
> ...
> I guess passing the purpose will make the streaming protocol more robust for 
> future changes and leaves decisions up to the receiver.
> But an additional "requiresFlush" would also avoid putting too much logic 
> into the streaming code. The streaming code should not care about purposes, 
> the caller or receiver should. So the decision if a stream requires as flush 
> before stream should be up to the stream requester and the stream request 
> receiver depending on the purpose of the stream.
> I'm excited about your feedback :)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (CASSANDRA-13064) Add stream type or purpose to stream plan / stream

2017-02-28 Thread Benjamin Roth (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Roth updated CASSANDRA-13064:
--
Fix Version/s: 4.0
   Status: Patch Available  (was: Open)

https://github.com/Jaumo/cassandra/commit/4189c949336f3c7e4ba25da80fdd7da5faa2ea65

> Add stream type or purpose to stream plan / stream
> --
>
> Key: CASSANDRA-13064
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13064
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>Assignee: Benjamin Roth
> Fix For: 4.0
>
>
> It would be very good to know the type or purpose of a certain stream on the 
> receiver side. It should be both available in a stream request and a stream 
> task.
> Why?
> It would be helpful to distinguish the purpose to allow different handling of 
> streams and requests. Examples:
> - In stream request a global flush is done. This is not necessary for all 
> types of streams. A repair stream(-plan) does not require a flush as this has 
> been done shortly before in validation compaction and only the sstables that 
> have been validated also have to be streamed.
> - In StreamReceiveTask streams for MVs go through the regular write path this 
> is painfully slow especially on bootstrap and decomission. Both for bootstrap 
> and decommission this is not necessary. Sstables can be directly streamed 
> down in this case. Handling bootstrap is no problem as it relies on a local 
> state but during decommission, the decom-state is bound to the sender and not 
> the receiver, so the receiver has to know that it is safe to stream that 
> sstable directly, not through the write-path. Thats why we have to know the 
> purpose of the stream.
> I'd love to implement this on my own but I am not sure how not to break the 
> streaming protocol for backwards compat or if it is ok to do so.
> Furthermore I'd love to get some feedback on that idea and some proposals 
> what stream types to distinguish. I could imagine:
> - bootstrap
> - decommission
> - repair
> - replace node
> - remove node
> - range relocation
> Comments like this support my idea, knowing the purpose could avoid this.
> {quote}
> // TODO each call to transferRanges re-flushes, this is 
> potentially a lot of waste
> streamPlan.transferRanges(newEndpoint, preferred, 
> keyspaceName, ranges);
> {quote}
> And alternative to passing the purpose of the stream was to pass flags like:
> - requiresFlush
> - requiresWritePathForMaterializedView
> ...
> I guess passing the purpose will make the streaming protocol more robust for 
> future changes and leaves decisions up to the receiver.
> But an additional "requiresFlush" would also avoid putting too much logic 
> into the streaming code. The streaming code should not care about purposes, 
> the caller or receiver should. So the decision if a stream requires as flush 
> before stream should be up to the stream requester and the stream request 
> receiver depending on the purpose of the stream.
> I'm excited about your feedback :)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb

2017-02-28 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889653#comment-15889653
 ] 

Benjamin Roth commented on CASSANDRA-13241:
---

I thought of 2 arrays because a semantic meaning (position vs chunk size) and a 
single alignment (8, 3, 2 byte) for each could be easier to understand and to 
maintain. Of course it works either way. With 2 arrays, you could still "pull 
sections", it's just a single fetch more to get the 8 byte absolute offset.
Loop summing vs. "relative-absolute offset": At the end this is always a 
tradeoff between mem/cpu. I personally am not the one who fights for every 
single byte in this case. But I also think some CPU cycles more to sum a bunch 
of ints is still bearable. I guess if I had to decide, I'd give "loop summing" 
a try. Any different opinions?

Do you mean a ChunkCache cache miss? Sorry for that kind of questions. I never 
came across this part of the code.

> Lower default chunk_length_in_kb from 64kb to 4kb
> -
>
> Key: CASSANDRA-13241
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13241
> Project: Cassandra
>  Issue Type: Wish
>  Components: Core
>Reporter: Benjamin Roth
>
> Having a too low chunk size may result in some wasted disk space. A too high 
> chunk size may lead to massive overreads and may have a critical impact on 
> overall system performance.
> In my case, the default chunk size lead to peak read IOs of up to 1GB/s and 
> avg reads of 200MB/s. After lowering chunksize (of course aligned with read 
> ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s.
> The risk of (physical) overreads is increasing with lower (page cache size) / 
> (total data size) ratio.
> High chunk sizes are mostly appropriate for bigger payloads pre request but 
> if the model consists rather of small rows or small resultsets, the read 
> overhead with 64kb chunk size is insanely high. This applies for example for 
> (small) skinny rows.
> Please also see here:
> https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY
> To give you some insights what a difference it can make (460GB data, 128GB 
> RAM):
> - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L
> - Disk throughput: https://cl.ly/2a0Z250S1M3c
> - This shows, that the request distribution remained the same, so no "dynamic 
> snitch magic": https://cl.ly/3E0t1T1z2c0J



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb

2017-02-28 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888911#comment-15888911
 ] 

Benjamin Roth commented on CASSANDRA-13241:
---

How about this:

You create 2 chunk lookup tables. One with absolute pointers (long, 8 byte).
A second one with relative pointers or chunk-sizes - 2 bytes are enough for up 
to 64kb chunks.
You store an absolute pointer for every $x chunks (1000 in this example).
So you can get the absolute offset looking up the absolute position with $idx = 
($pos - ($pos % 100)) / $x
Then you iterate through the size lookup from ($pos - ($pos % 100)) to $pos - 1.
A fallback can be provided for chunks >64kb. Either relative pointers are 
completely avoided or are increased to 3 bytes.

There you go.

Payload of 1 TB = 1024 * 1024 * 1024kb

CS 64 (NOW):

chunks = 1024 * 1024 * 1024kb / 64kb = 16777216 (16M)
compression = 1.99
compressed_size = 1024 * 1024 * 1024kb / 1.99 = 539568756kb
kernel_pages = 134892189
absolute_pointer_size = 8 * chunks = 134217728 (128MB)
kernel_page_size = 134892189 * 8 (1029 MB)
total_size = 1157MB

CS 4 with relative positions

chunks = 1024 * 1024 * 1024kb / 4kb = 268435456 (256M)
compression = 1.75
compressed_size = 1024 * 1024 * 1024kb / 1.75 = 613566757kb
kernel_pages = 153391689
absolute_pointer_size = 8 * chunks / 1000 = 2147484 (2 MB)
relative_pointer_size = 2 * chunks = 536870912 (512 MB)
kernel_page_size = 153391689 * 8 = 1227133512 (1170MB)
total_size = 1684MB

increase = 45%

=> Reduces memory overhead when reducing chunk size from 64kb to 4kb from the 
initially mentioned 800% to 45%
when you also take kernel structs into account which are also of a relevant 
size - even more than the initially discussed "128M" for 64kb chunks

Pro:
A lot less memory required

Con:
Some CPU overhead. But is this really relevant compared to decompressing 4kb or 
even 64kb?

P.S.: Kernel memory calculation is based on the 8 bytes [~aweisberg] has 
researched. Compression ratios are taken from the percona blog.

> Lower default chunk_length_in_kb from 64kb to 4kb
> -
>
> Key: CASSANDRA-13241
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13241
> Project: Cassandra
>  Issue Type: Wish
>  Components: Core
>Reporter: Benjamin Roth
>
> Having a too low chunk size may result in some wasted disk space. A too high 
> chunk size may lead to massive overreads and may have a critical impact on 
> overall system performance.
> In my case, the default chunk size lead to peak read IOs of up to 1GB/s and 
> avg reads of 200MB/s. After lowering chunksize (of course aligned with read 
> ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s.
> The risk of (physical) overreads is increasing with lower (page cache size) / 
> (total data size) ratio.
> High chunk sizes are mostly appropriate for bigger payloads pre request but 
> if the model consists rather of small rows or small resultsets, the read 
> overhead with 64kb chunk size is insanely high. This applies for example for 
> (small) skinny rows.
> Please also see here:
> https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY
> To give you some insights what a difference it can make (460GB data, 128GB 
> RAM):
> - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L
> - Disk throughput: https://cl.ly/2a0Z250S1M3c
> - This shows, that the request distribution remained the same, so no "dynamic 
> snitch magic": https://cl.ly/3E0t1T1z2c0J



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (CASSANDRA-13064) Add stream type or purpose to stream plan / stream

2017-02-28 Thread Benjamin Roth (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Roth reassigned CASSANDRA-13064:
-

Assignee: Benjamin Roth

> Add stream type or purpose to stream plan / stream
> --
>
> Key: CASSANDRA-13064
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13064
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>Assignee: Benjamin Roth
>
> It would be very good to know the type or purpose of a certain stream on the 
> receiver side. It should be both available in a stream request and a stream 
> task.
> Why?
> It would be helpful to distinguish the purpose to allow different handling of 
> streams and requests. Examples:
> - In stream request a global flush is done. This is not necessary for all 
> types of streams. A repair stream(-plan) does not require a flush as this has 
> been done shortly before in validation compaction and only the sstables that 
> have been validated also have to be streamed.
> - In StreamReceiveTask streams for MVs go through the regular write path this 
> is painfully slow especially on bootstrap and decomission. Both for bootstrap 
> and decommission this is not necessary. Sstables can be directly streamed 
> down in this case. Handling bootstrap is no problem as it relies on a local 
> state but during decommission, the decom-state is bound to the sender and not 
> the receiver, so the receiver has to know that it is safe to stream that 
> sstable directly, not through the write-path. Thats why we have to know the 
> purpose of the stream.
> I'd love to implement this on my own but I am not sure how not to break the 
> streaming protocol for backwards compat or if it is ok to do so.
> Furthermore I'd love to get some feedback on that idea and some proposals 
> what stream types to distinguish. I could imagine:
> - bootstrap
> - decommission
> - repair
> - replace node
> - remove node
> - range relocation
> Comments like this support my idea, knowing the purpose could avoid this.
> {quote}
> // TODO each call to transferRanges re-flushes, this is 
> potentially a lot of waste
> streamPlan.transferRanges(newEndpoint, preferred, 
> keyspaceName, ranges);
> {quote}
> And alternative to passing the purpose of the stream was to pass flags like:
> - requiresFlush
> - requiresWritePathForMaterializedView
> ...
> I guess passing the purpose will make the streaming protocol more robust for 
> future changes and leaves decisions up to the receiver.
> But an additional "requiresFlush" would also avoid putting too much logic 
> into the streaming code. The streaming code should not care about purposes, 
> the caller or receiver should. So the decision if a stream requires as flush 
> before stream should be up to the stream requester and the stream request 
> receiver depending on the purpose of the stream.
> I'm excited about your feedback :)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13279) Table default settings file

2017-02-28 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888112#comment-15888112
 ] 

Benjamin Roth commented on CASSANDRA-13279:
---

Maybe it was a bit misleading. I am not defending a new source per se. I am 
simply 'pro' improving docs by adding problem/solution centric resources in a 
place that can easily be found by anyone. E.g. if I google for "Cassandra 
performance tuning", the first match should go to an official guide. 

I'd love to volunteer but first I'd like to work on MVs which I am deferring 
since end of '16.
But if there is a consensus of a possible structure and if I have access to the 
docs I am happy to add content whenever I feel like.

> Table default settings file
> ---
>
> Key: CASSANDRA-13279
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13279
> Project: Cassandra
>  Issue Type: Wish
>  Components: Configuration
>Reporter: Romain Hardouin
>Priority: Minor
>  Labels: config, documentation
>
> Following CASSANDRA-13241 we often see that there is no one-size-fits-all 
> value for settings. We can't find a sweet spot for every use cases.
> It's true for settings in cassandra.yaml but as [~brstgt] said for 
> {{chunk_length_in_kb}}: "this is somewhat hidden for the average user". 
> Many table settings are somewhat hidden for the average user. Some people 
> will think RTFM but if a file - say tables.yaml - contains default values for 
> table settings, more people would pay attention to them. And of course this 
> file could contain useful comments and guidance. 
> Example with SSTable compression options:
> {code}
> # General comments about sstable compression
> compression:
> # First of all: explain what is it. We split each SSTable into chunks, 
> etc.
> # Explain when users should lower this value (e.g. 4) or when a higher 
> value like 64 or 128 are recommended.
> # Explain the trade-off between read latency and off-heap compression 
> metadata size.
> chunk_length_in_kb: 16
> 
> # List of available compressor: LZ4Compressor, SnappyCompressor, and 
> DeflateCompressor
> # Explain trade-offs, some specific use cases (e.g. archives), etc.
> class: 'LZ4Compressor'
> 
> # If you want to disable compression by default, uncomment the following 
> line
> #enabled: false
> {code}
> So instead of hard coded values we would end up with something like 
> TableConfig + TableDescriptor à la Config + DatabaseDescriptor.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths

2017-02-28 Thread Benjamin Roth (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Roth reassigned CASSANDRA-13065:
-

Assignee: Benjamin Roth

> Consistent range movements to not require MV updates to go through write 
> paths 
> ---
>
> Key: CASSANDRA-13065
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13065
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>Assignee: Benjamin Roth
>Priority: Critical
>
> Booting or decommisioning nodes with MVs is unbearably slow as all streams go 
> through the regular write paths. This causes read-before-writes for every 
> mutation and during bootstrap it causes them to be sent to batchlog.
> The makes it virtually impossible to boot a new node in an acceptable amount 
> of time.
> Using the regular streaming behaviour for consistent range movements works 
> much better in this case and does not break the MV local consistency contract.
> Already tested on own cluster.
> Bootstrap case is super easy to handle, decommission case requires 
> CASSANDRA-13064



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13279) Table default settings file

2017-02-28 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888084#comment-15888084
 ] 

Benjamin Roth commented on CASSANDRA-13279:
---

I can understand your consideration about the deployment issues of centralized 
settings in a non-centralized settings file.

But I have contradict in the second point. By "somewhat hidden" I don't mean it 
does not exist but an average user won't come across the documentation or the 
valueable information (why should I tweak that) that is related to it. 
It is very difficult to find the right resource / doc in the CS ecosystem. 
There is datastax, there is the official CS site (which contains a lot of TODOs 
and empty pages), wiki.apache.org (looks very outdated) and there are zillions 
of distributed and spread resources like blogs all over the net. Finding the 
right information (as a new user) is the famous needle in the haystack. You are 
a user / developer from the early ages and know every corner of the CS universe 
but for new users it is hardly overseeable and 'somewhat hidden'.

To be honest:
When I first installed and tested CS, I was totally lost. I had to test a lot, 
read many many many different resources, go through the hell of trial and 
error, analyzing, debugging, compiling and testing again with a lot of pain to 
get the knowledge I have to day. Tweaking chunk_size was quite the same. I 
tried a lot of stuff, posted on lists, ... and after some days I was like 
"Wait, there was this setting in DevCenter with that 'chunk_size', what does it 
exactly do and what happens if ... AH it works!".

How about creating a structure in the official cassandra docs with use cases 
and Q for performance tuning?
Sth. like a structured version of Al Tobeys tuning guide with a Problem > 
Solution section.


> Table default settings file
> ---
>
> Key: CASSANDRA-13279
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13279
> Project: Cassandra
>  Issue Type: Wish
>  Components: Configuration
>Reporter: Romain Hardouin
>Priority: Minor
>  Labels: config, documentation
>
> Following CASSANDRA-13241 we often see that there is no one-size-fits-all 
> value for settings. We can't find a sweet spot for every use cases.
> It's true for settings in cassandra.yaml but as [~brstgt] said for 
> {{chunk_length_in_kb}}: "this is somewhat hidden for the average user". 
> Many table settings are somewhat hidden for the average user. Some people 
> will think RTFM but if a file - say tables.yaml - contains default values for 
> table settings, more people would pay attention to them. And of course this 
> file could contain useful comments and guidance. 
> Example with SSTable compression options:
> {code}
> # General comments about sstable compression
> compression:
> # First of all: explain what is it. We split each SSTable into chunks, 
> etc.
> # Explain when users should lower this value (e.g. 4) or when a higher 
> value like 64 or 128 are recommended.
> # Explain the trade-off between read latency and off-heap compression 
> metadata size.
> chunk_length_in_kb: 16
> 
> # List of available compressor: LZ4Compressor, SnappyCompressor, and 
> DeflateCompressor
> # Explain trade-offs, some specific use cases (e.g. archives), etc.
> class: 'LZ4Compressor'
> 
> # If you want to disable compression by default, uncomment the following 
> line
> #enabled: false
> {code}
> So instead of hard coded values we would end up with something like 
> TableConfig + TableDescriptor à la Config + DatabaseDescriptor.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (CASSANDRA-13226) StreamPlan for incremental repairs flushing memtables unnecessarily

2017-02-28 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887961#comment-15887961
 ] 

Benjamin Roth edited comment on CASSANDRA-13226 at 2/28/17 1:12 PM:


Sorry for that many comments, just another thought:

Flushes can be optimized very easily in that way that a flush is only executed 
if the memtable contains mutations for the requested range OR if the memtable 
exceeds a certain size, so that the check is still cheap. I implemented this 
just for fun some months ago but did never create a ticket for it.

See patch here 
https://github.com/Jaumo/cassandra/commit/983514b0d3e15cea042533273ead5ea33c00bacf

Just saw it also disabled pre-repair flush as proposed before.


was (Author: brstgt):
Sorry for that many comments, just another thought:

Flushes can be optimized very easily in that way that a flush is only executed 
if the memtable contains mutations for the requested range OR if the memtable 
exceeds a certain size, so that the check is still cheap. I implemented this 
just for fun some months ago but did never create a ticket for it.

> StreamPlan for incremental repairs flushing memtables unnecessarily
> ---
>
> Key: CASSANDRA-13226
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13226
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Blake Eggleston
>Assignee: Blake Eggleston
>Priority: Minor
> Fix For: 4.0
>
>
> Since incremental repairs are run against a fixed dataset, there's no need to 
> flush memtables when streaming for them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13226) StreamPlan for incremental repairs flushing memtables unnecessarily

2017-02-28 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887961#comment-15887961
 ] 

Benjamin Roth commented on CASSANDRA-13226:
---

Sorry for that many comments, just another thought:

Flushes can be optimized very easily in that way that a flush is only executed 
if the memtable contains mutations for the requested range OR if the memtable 
exceeds a certain size, so that the check is still cheap. I implemented this 
just for fun some months ago but did never create a ticket for it.

> StreamPlan for incremental repairs flushing memtables unnecessarily
> ---
>
> Key: CASSANDRA-13226
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13226
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Blake Eggleston
>Assignee: Blake Eggleston
>Priority: Minor
> Fix For: 4.0
>
>
> Since incremental repairs are run against a fixed dataset, there's no need to 
> flush memtables when streaming for them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13226) StreamPlan for incremental repairs flushing memtables unnecessarily

2017-02-28 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887945#comment-15887945
 ] 

Benjamin Roth commented on CASSANDRA-13226:
---

I am referring to this "stacktrace":

RepairMessageVerbHandler.doVerb (case VALIDATION_REQUEST)
CompactionManager.instance.submitValidation(store, validator) 
CompactionManager.doValidationCompaction
=> StorageService.instance.forceKeyspaceFlush

After that merkle trees are calculated and based on that streams are triggered. 
Thats why all data that is electable for transfer has already been flushed.

Also avoiding a flush locally is only the half way. Streams REQUESTED by a 
stream plan also cause a flush on the sender side. But that sender also has 
already validated (and so flushed) the requested data.

Maybe I missed sth but from what I can see, a REPAIR stream never requires a 
flush.

> StreamPlan for incremental repairs flushing memtables unnecessarily
> ---
>
> Key: CASSANDRA-13226
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13226
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Blake Eggleston
>Assignee: Blake Eggleston
>Priority: Minor
> Fix For: 4.0
>
>
> Since incremental repairs are run against a fixed dataset, there's no need to 
> flush memtables when streaming for them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13279) Table default settings file

2017-02-28 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887733#comment-15887733
 ] 

Benjamin Roth commented on CASSANDRA-13279:
---

Great idea!

+1

> Table default settings file
> ---
>
> Key: CASSANDRA-13279
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13279
> Project: Cassandra
>  Issue Type: Wish
>  Components: Configuration
>Reporter: Romain Hardouin
>Priority: Minor
>  Labels: config, documentation
>
> Following CASSANDRA-13241 we often see that there is no one-size-fits-all 
> value for settings. We can't find a sweet spot for every use cases.
> It's true for settings in cassandra.yaml but as [~brstgt] said for 
> {{chunk_length_in_kb}}: "this is somewhat hidden for the average user". 
> Many table settings are somewhat hidden for the average user. Some people 
> will think RTFM but if a file - say tables.yaml - contains default values for 
> table settings, more people would pay attention to them. And of course this 
> file could contain useful comments and guidance. 
> Example with SSTable compression options:
> {code}
> # General comments about sstable compression
> compression:
> # First of all: explain what is it. We split each SSTable into chunks, 
> etc.
> # Explain when users should lower this value (e.g. 4) or when a higher 
> value like 64 or 128 are recommended.
> # Explain the trade-off between read latency and off-heap compression 
> metadata size.
> chunk_length_in_kb: 16
> 
> # List of available compressor: LZ4Compressor, SnappyCompressor, and 
> DeflateCompressor
> # Explain trade-offs, some specific use cases (e.g. archives), etc.
> class: 'LZ4Compressor'
> 
> # If you want to disable compression by default, uncomment the following 
> line
> #enabled: false
> {code}
> So instead of hard coded values we would end up with something like 
> TableConfig + TableDescriptor à la Config + DatabaseDescriptor.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13226) StreamPlan for incremental repairs flushing memtables unnecessarily

2017-02-28 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887654#comment-15887654
 ] 

Benjamin Roth commented on CASSANDRA-13226:
---

Isn't this also true for non-incremental repairs?
Merkle tree calculation also triggers a flush and any repair begins with a 
merkle tree. So there is no need to flush as the inconsistent dataset to be 
streamed for repair is always contained in SSTables flushed by MT calculation 
before.

> StreamPlan for incremental repairs flushing memtables unnecessarily
> ---
>
> Key: CASSANDRA-13226
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13226
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Blake Eggleston
>Assignee: Blake Eggleston
>Priority: Minor
> Fix For: 4.0
>
>
> Since incremental repairs are run against a fixed dataset, there's no need to 
> flush memtables when streaming for them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb

2017-02-28 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887602#comment-15887602
 ] 

Benjamin Roth edited comment on CASSANDRA-13241 at 2/28/17 8:58 AM:


Just thinking about Jeffs + Bens comments:

Even if you have 4 TB of data and 32GB RAM 4KB might help. In that (extreme) 
case, you'd steal ~8GB from page cache for "chunk tables". These 8GB probably 
would have helped a fraction of nothing when used as page cache if you look at 
the RAM/Load ratio. Most probably the PC would be totally ineffective, if you 
don't have a very, very low percentage of hot data. So the probability that 
nearly every read results in a physical IO is very high.
So in that case lowering the chunk size to 4KB would at least save you from 
immense overread and help the SSDs to survive that situation.

That said, I see only one REAL problem:
If you have more chunk-offset data than fits in your memory. But in that case 
my answer would simply be: Get more RAM. There are certain mininum requirements 
you MUST fulfill. The imagination of running a node with many TBs of data with 
less than say 16-32GB is simply insane from any perspective.

Nevertheless optimizing the memory usage of chunk-offset lookup would be a big 
deal either.


was (Author: brstgt):
Just thinking about Jeffs + Bens comments:

Even if you have 4 TB of data and 32GB RAM 4KB might help. In that (extreme) 
case, you'd steal ~8GB from page cache for "chunk tables". These 8GB probably 
would have helped a fraction of nothing when used as page cache if you look at 
the RAM/Load ratio. Most probably the PC would be totally ineffective, if you 
don't have a very, very low percentage of hot data. So the probability that 
nearly every read results in a physical IO is very high.
So in that case lowering the chunk size to 4KB would at least save you from 
immense overread and help the SSDs to survive that situation.

That said, I see only one REAL problem:
If you have more chunk-offset data than fits in your memory. But in that case 
my answer would simply be: Get more RAM. There are certain mininum requirements 
you MUST fulfill. The imagination of running a node with many TBs of data with 
less than say 16-32GB is simply insane from all kinds of perspective.

Nevertheless optimizing the memory usage of chunk-offset lookup would be a big 
deal either.

> Lower default chunk_length_in_kb from 64kb to 4kb
> -
>
> Key: CASSANDRA-13241
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13241
> Project: Cassandra
>  Issue Type: Wish
>  Components: Core
>Reporter: Benjamin Roth
>
> Having a too low chunk size may result in some wasted disk space. A too high 
> chunk size may lead to massive overreads and may have a critical impact on 
> overall system performance.
> In my case, the default chunk size lead to peak read IOs of up to 1GB/s and 
> avg reads of 200MB/s. After lowering chunksize (of course aligned with read 
> ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s.
> The risk of (physical) overreads is increasing with lower (page cache size) / 
> (total data size) ratio.
> High chunk sizes are mostly appropriate for bigger payloads pre request but 
> if the model consists rather of small rows or small resultsets, the read 
> overhead with 64kb chunk size is insanely high. This applies for example for 
> (small) skinny rows.
> Please also see here:
> https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY
> To give you some insights what a difference it can make (460GB data, 128GB 
> RAM):
> - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L
> - Disk throughput: https://cl.ly/2a0Z250S1M3c
> - This shows, that the request distribution remained the same, so no "dynamic 
> snitch magic": https://cl.ly/3E0t1T1z2c0J



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb

2017-02-28 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887602#comment-15887602
 ] 

Benjamin Roth commented on CASSANDRA-13241:
---

Just thinking about Jeffs + Bens comments:

Even if you have 4 TB of data and 32GB RAM 4KB might help. In that (extreme) 
case, you'd steal ~8GB from page cache for "chunk tables". These 8GB probably 
would have helped a fraction of nothing when used as page cache if you look at 
the RAM/Load ratio. Most probably the PC would be totally ineffective, if you 
don't have a very, very low percentage of hot data. So the probability that 
nearly every read results in a physical IO is very high.
So in that case lowering the chunk size to 4KB would at least save you from 
immense overread and help the SSDs to survive that situation.

That said, I see only one REAL problem:
If you have more chunk-offset data than fits in your memory. But in that case 
my answer would simply be: Get more RAM. There are certain mininum requirements 
you MUST fulfill. The imagination of running a node with many TBs of data with 
less than say 16-32GB is simply insane from all kinds of perspective.

Nevertheless optimizing the memory usage of chunk-offset lookup would be a big 
deal either.

> Lower default chunk_length_in_kb from 64kb to 4kb
> -
>
> Key: CASSANDRA-13241
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13241
> Project: Cassandra
>  Issue Type: Wish
>  Components: Core
>Reporter: Benjamin Roth
>
> Having a too low chunk size may result in some wasted disk space. A too high 
> chunk size may lead to massive overreads and may have a critical impact on 
> overall system performance.
> In my case, the default chunk size lead to peak read IOs of up to 1GB/s and 
> avg reads of 200MB/s. After lowering chunksize (of course aligned with read 
> ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s.
> The risk of (physical) overreads is increasing with lower (page cache size) / 
> (total data size) ratio.
> High chunk sizes are mostly appropriate for bigger payloads pre request but 
> if the model consists rather of small rows or small resultsets, the read 
> overhead with 64kb chunk size is insanely high. This applies for example for 
> (small) skinny rows.
> Please also see here:
> https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY
> To give you some insights what a difference it can make (460GB data, 128GB 
> RAM):
> - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L
> - Disk throughput: https://cl.ly/2a0Z250S1M3c
> - This shows, that the request distribution remained the same, so no "dynamic 
> snitch magic": https://cl.ly/3E0t1T1z2c0J



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb

2017-02-27 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887451#comment-15887451
 ] 

Benjamin Roth commented on CASSANDRA-13241:
---

[~aweisberg] I didn't really get the point of your comment. Would you like to 
explain?

[~jjirsa]
I understand you consideration. A default value should avoid worst cases for 
most or all people and not optimize one case. So maybe yes, we could choose sth 
in between.
Do you see a way to offer a recommendation to users similar to the comments of 
cassandra.yaml. IMHO this table option is somewhat hidden for the average user 
but may have a huge impact on your overall server load and your latency.

> Lower default chunk_length_in_kb from 64kb to 4kb
> -
>
> Key: CASSANDRA-13241
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13241
> Project: Cassandra
>  Issue Type: Wish
>  Components: Core
>Reporter: Benjamin Roth
>
> Having a too low chunk size may result in some wasted disk space. A too high 
> chunk size may lead to massive overreads and may have a critical impact on 
> overall system performance.
> In my case, the default chunk size lead to peak read IOs of up to 1GB/s and 
> avg reads of 200MB/s. After lowering chunksize (of course aligned with read 
> ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s.
> The risk of (physical) overreads is increasing with lower (page cache size) / 
> (total data size) ratio.
> High chunk sizes are mostly appropriate for bigger payloads pre request but 
> if the model consists rather of small rows or small resultsets, the read 
> overhead with 64kb chunk size is insanely high. This applies for example for 
> (small) skinny rows.
> Please also see here:
> https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY
> To give you some insights what a difference it can make (460GB data, 128GB 
> RAM):
> - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L
> - Disk throughput: https://cl.ly/2a0Z250S1M3c
> - This shows, that the request distribution remained the same, so no "dynamic 
> snitch magic": https://cl.ly/3E0t1T1z2c0J



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb

2017-02-27 Thread Benjamin Roth (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Roth updated CASSANDRA-13241:
--

Hm. I read recommendations that a single node should not have a load of
more than 1-2 TB per node. And I read recommendations of having at least
128gb RAM. If I pay 2gb for a recommended max load to have a MUCH better
performance on uncached io that makes more than 80% of recommended sizings
(on equally hot data) it seems a quite fair price to me.
If there is much less hot data it probably still works as you only deal
page cache for faster io. The fewer hot data the fewer page cache is
required.

Did I miss a point?

Btw 4kb worked perfectly for me with 460gb load/128gb RAM. 64kb did not
work well. Really.




> Lower default chunk_length_in_kb from 64kb to 4kb
> -
>
> Key: CASSANDRA-13241
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13241
> Project: Cassandra
>  Issue Type: Wish
>  Components: Core
>Reporter: Benjamin Roth
>
> Having a too low chunk size may result in some wasted disk space. A too high 
> chunk size may lead to massive overreads and may have a critical impact on 
> overall system performance.
> In my case, the default chunk size lead to peak read IOs of up to 1GB/s and 
> avg reads of 200MB/s. After lowering chunksize (of course aligned with read 
> ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s.
> The risk of (physical) overreads is increasing with lower (page cache size) / 
> (total data size) ratio.
> High chunk sizes are mostly appropriate for bigger payloads pre request but 
> if the model consists rather of small rows or small resultsets, the read 
> overhead with 64kb chunk size is insanely high. This applies for example for 
> (small) skinny rows.
> Please also see here:
> https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY
> To give you some insights what a difference it can make (460GB data, 128GB 
> RAM):
> - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L
> - Disk throughput: https://cl.ly/2a0Z250S1M3c
> - This shows, that the request distribution remained the same, so no "dynamic 
> snitch magic": https://cl.ly/3E0t1T1z2c0J



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb

2017-02-27 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15886536#comment-15886536
 ] 

Benjamin Roth commented on CASSANDRA-13241:
---

According to percona 
(https://www.percona.com/blog/2016/03/09/evaluating-database-compression-methods/)
 and my own experience, the impact on compression ratio is not that big with 
lz4.

Can the increased offheap requirements be expressed in a formula?

> Lower default chunk_length_in_kb from 64kb to 4kb
> -
>
> Key: CASSANDRA-13241
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13241
> Project: Cassandra
>  Issue Type: Wish
>  Components: Core
>Reporter: Benjamin Roth
>
> Having a too low chunk size may result in some wasted disk space. A too high 
> chunk size may lead to massive overreads and may have a critical impact on 
> overall system performance.
> In my case, the default chunk size lead to peak read IOs of up to 1GB/s and 
> avg reads of 200MB/s. After lowering chunksize (of course aligned with read 
> ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s.
> The risk of (physical) overreads is increasing with lower (page cache size) / 
> (total data size) ratio.
> High chunk sizes are mostly appropriate for bigger payloads pre request but 
> if the model consists rather of small rows or small resultsets, the read 
> overhead with 64kb chunk size is insanely high. This applies for example for 
> (small) skinny rows.
> Please also see here:
> https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY
> To give you some insights what a difference it can make (460GB data, 128GB 
> RAM):
> - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L
> - Disk throughput: https://cl.ly/2a0Z250S1M3c
> - This shows, that the request distribution remained the same, so no "dynamic 
> snitch magic": https://cl.ly/3E0t1T1z2c0J



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb

2017-02-27 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15886517#comment-15886517
 ] 

Benjamin Roth commented on CASSANDRA-13241:
---

No worries. Your patch answered my questions implicitly. Thanks!

> Lower default chunk_length_in_kb from 64kb to 4kb
> -
>
> Key: CASSANDRA-13241
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13241
> Project: Cassandra
>  Issue Type: Wish
>  Components: Core
>Reporter: Benjamin Roth
>
> Having a too low chunk size may result in some wasted disk space. A too high 
> chunk size may lead to massive overreads and may have a critical impact on 
> overall system performance.
> In my case, the default chunk size lead to peak read IOs of up to 1GB/s and 
> avg reads of 200MB/s. After lowering chunksize (of course aligned with read 
> ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s.
> The risk of (physical) overreads is increasing with lower (page cache size) / 
> (total data size) ratio.
> High chunk sizes are mostly appropriate for bigger payloads pre request but 
> if the model consists rather of small rows or small resultsets, the read 
> overhead with 64kb chunk size is insanely high. This applies for example for 
> (small) skinny rows.
> Please also see here:
> https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY
> To give you some insights what a difference it can make (460GB data, 128GB 
> RAM):
> - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L
> - Disk throughput: https://cl.ly/2a0Z250S1M3c
> - This shows, that the request distribution remained the same, so no "dynamic 
> snitch magic": https://cl.ly/3E0t1T1z2c0J



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb

2017-02-27 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15886402#comment-15886402
 ] 

Benjamin Roth commented on CASSANDRA-13241:
---

Thanks for your vote, but ... maybe this is a stupid question: Who will finally 
decide if that change is accepted?
I think I could make a patch pretty easily but how does change management work?

> Lower default chunk_length_in_kb from 64kb to 4kb
> -
>
> Key: CASSANDRA-13241
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13241
> Project: Cassandra
>  Issue Type: Wish
>  Components: Core
>Reporter: Benjamin Roth
>
> Having a too low chunk size may result in some wasted disk space. A too high 
> chunk size may lead to massive overreads and may have a critical impact on 
> overall system performance.
> In my case, the default chunk size lead to peak read IOs of up to 1GB/s and 
> avg reads of 200MB/s. After lowering chunksize (of course aligned with read 
> ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s.
> The risk of (physical) overreads is increasing with lower (page cache size) / 
> (total data size) ratio.
> High chunk sizes are mostly appropriate for bigger payloads pre request but 
> if the model consists rather of small rows or small resultsets, the read 
> overhead with 64kb chunk size is insanely high. This applies for example for 
> (small) skinny rows.
> Please also see here:
> https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY
> To give you some insights what a difference it can make (460GB data, 128GB 
> RAM):
> - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L
> - Disk throughput: https://cl.ly/2a0Z250S1M3c
> - This shows, that the request distribution remained the same, so no "dynamic 
> snitch magic": https://cl.ly/3E0t1T1z2c0J



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb

2017-02-22 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878185#comment-15878185
 ] 

Benjamin Roth commented on CASSANDRA-13241:
---

Thanks for your comment. Of course there is no perfect match for all cases. 
IMHO the default value should more avoid the worst negative impacts for most or 
all cases than bringing great results for some use cases.
I personally use 4KB with >450GB data on a 128GB (12GB JVM heap) machine and 
the situation improved A LOT. We also have tables with >10M partitions and I 
didn't see any problems until now.

If someone has a better proposal and maybe an explanation, why not.

> Lower default chunk_length_in_kb from 64kb to 4kb
> -
>
> Key: CASSANDRA-13241
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13241
> Project: Cassandra
>  Issue Type: Wish
>  Components: Core
>Reporter: Benjamin Roth
>
> Having a too low chunk size may result in some wasted disk space. A too high 
> chunk size may lead to massive overreads and may have a critical impact on 
> overall system performance.
> In my case, the default chunk size lead to peak read IOs of up to 1GB/s and 
> avg reads of 200MB/s. After lowering chunksize (of course aligned with read 
> ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s.
> The risk of (physical) overreads is increasing with lower (page cache size) / 
> (total data size) ratio.
> High chunk sizes are mostly appropriate for bigger payloads pre request but 
> if the model consists rather of small rows or small resultsets, the read 
> overhead with 64kb chunk size is insanely high. This applies for example for 
> (small) skinny rows.
> Please also see here:
> https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY
> To give you some insights what a difference it can make (460GB data, 128GB 
> RAM):
> - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L
> - Disk throughput: https://cl.ly/2a0Z250S1M3c
> - This shows, that the request distribution remained the same, so no "dynamic 
> snitch magic": https://cl.ly/3E0t1T1z2c0J



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 4kb

2017-02-20 Thread Benjamin Roth (JIRA)
Benjamin Roth created CASSANDRA-13241:
-

 Summary: Lower default chunk_length_in_kb from 64kb to 4kb
 Key: CASSANDRA-13241
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13241
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Benjamin Roth


Having a too low chunk size may result in some wasted disk space. A too high 
chunk size may lead to massive overreads and may have a critical impact on 
overall system performance.

In my case, the default chunk size lead to peak read IOs of up to 1GB/s and avg 
reads of 200MB/s. After lowering chunksize (of course aligned with read ahead), 
the avg read IO went below 20 MB/s, rather 10-15MB/s.

The risk of (physical) overreads is increasing with lower (page cache size) / 
(total data size) ratio.

High chunk sizes are mostly appropriate for bigger payloads pre request but if 
the model consists rather of small rows or small resultsets, the read overhead 
with 64kb chunk size is insanely high. This applies for example for (small) 
skinny rows.

Please also see here:
https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY

To give you some insights what a difference it can make (460GB data, 128GB RAM):
- Latency of a quite large CF: https://cl.ly/1r3e0W0S393L
- Disk throughput: https://cl.ly/2a0Z250S1M3c
- This shows, that the request distribution remained the same, so no "dynamic 
snitch magic": https://cl.ly/3E0t1T1z2c0J



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC

2017-01-12 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15821204#comment-15821204
 ] 

Benjamin Roth commented on CASSANDRA-12888:
---

Hi Victor,

1. Performance:
Performance can be better with MV than with batches but this depends on the 
read performance of the base table vs. the performance overhead for batches, 
which also is dependent on the batch size and the batchlog performance. An MV 
always creates a read before write, so it depends much on this how the MV 
performs. The final write operation of the MV update is fast as it works like a 
regular (local) write.

2. Partition Keys and remote MV updates
You are of course right, that this may be a common use case. You have to use it 
carefully. Maybe the situation already has improved by some bugfixes. The last 
time I tried was some months ago. To be fair I have to mention that back then 
there was a bug with a race condition that could deadlock the whole mutation 
stage. With "remote MVs" we ran very frequently into this situation during 
bootstraps (for example). This has to do with MV-locks and probably the much 
longer lock-time when the MV update is remote, leading to more lock-contention. 
With remote MV updates, the current write request also depends on the 
performance of remote nodes. This can lead to write timeouts much faster as 
long as the (remote) MV update is part of the write request and not deferred. 
So again: Maybe this situation has improved meanwhile but I personally didn't 
require it so I was able to use normal tables to "twist" the PK. We currently 
use MVs only to add a field to the primary key for sorting.

> Incremental repairs broken for MVs and CDC
> --
>
> Key: CASSANDRA-12888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Stefan Podkowinski
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.0.x, 3.x
>
>
> SSTables streamed during the repair process will first be written locally and 
> afterwards either simply added to the pool of existing sstables or, in case 
> of existing MVs or active CDC, replayed on mutation basis:
> As described in {{StreamReceiveTask.OnCompletionRunnable}}:
> {quote}
> We have a special path for views and for CDC.
> For views, since the view requires cleaning up any pre-existing state, we 
> must put all partitions through the same write path as normal mutations. This 
> also ensures any 2is are also updated.
> For CDC-enabled tables, we want to ensure that the mutations are run through 
> the CommitLog so they can be archived by the CDC process on discard.
> {quote}
> Using the regular write path turns out to be an issue for incremental 
> repairs, as we loose the {{repaired_at}} state in the process. Eventually the 
> streamed rows will end up in the unrepaired set, in contrast to the rows on 
> the sender site moved to the repaired set. The next repair run will stream 
> the same data back again, causing rows to bounce on and on between nodes on 
> each repair.
> See linked dtest on steps to reproduce. An example for reproducing this 
> manually using ccm can be found 
> [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC

2017-01-11 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818955#comment-15818955
 ] 

Benjamin Roth commented on CASSANDRA-12888:
---

Hi Victor,

We use MVs in Production with billions of records without known data loss.
Painful + slow refers to repairs and range movements (e.g. bootstrap +
decommission). Also (as mentioned in this ticket) incremental repairs dont
work, so full repair creates some overhead. Until 3.10 there are bugs
leading to write timeouts, even to NPEs and completely blocked mutation
stages. This could even bring your cluster down. In 3.10 some issues have
been resolved - actually we use a patched trunk version which is 1-2 months
old.

Depending on your model, MVs can help a lot from a developer perspective.
Some cases are very resource intensive to manage without MVs, requiring
distributed locks and/or CAS.
For append-only workloads, it may be simpler to NOT use MVs at the moment.
They aren't very complex and MVs wont help that much compared to the
problems that may raise with them.

Painful scenarios: There is no recipe for that. You may or may not
encounter performance issues, depending on your model and your workload.
I'd recommend not to use MVs that use a different partition key on the MV
than on the base table as this requires inter-node communication for EVERY
write operation. So you can easily kill your cluster with bulk operations
(like in streaming).

At the moment our cluster runs stable but it took months to find all the
bottlenecks, race conditions, resume from failures and so on. So my
recommendation: You can get it work but you need time and you should not
start with critical data, at least if it is not backed by another stable
storage. And you should use 3.10 when it is finally released or build your
own version from trunk. I would not recommend to use < 3.10 for MVs.

Btw.: Our own patched version does some dirty tricks, that may lead to
inconsistencies in some situations but we prefer some possible
inconsistencies (we can deal with) over performance bottlenecks. I created
several tickets to improve MV performance in some streaming situations but
it will take some time to really improve that situation.

Does this answer your question?






-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer


> Incremental repairs broken for MVs and CDC
> --
>
> Key: CASSANDRA-12888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Stefan Podkowinski
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.0.x, 3.x
>
>
> SSTables streamed during the repair process will first be written locally and 
> afterwards either simply added to the pool of existing sstables or, in case 
> of existing MVs or active CDC, replayed on mutation basis:
> As described in {{StreamReceiveTask.OnCompletionRunnable}}:
> {quote}
> We have a special path for views and for CDC.
> For views, since the view requires cleaning up any pre-existing state, we 
> must put all partitions through the same write path as normal mutations. This 
> also ensures any 2is are also updated.
> For CDC-enabled tables, we want to ensure that the mutations are run through 
> the CommitLog so they can be archived by the CDC process on discard.
> {quote}
> Using the regular write path turns out to be an issue for incremental 
> repairs, as we loose the {{repaired_at}} state in the process. Eventually the 
> streamed rows will end up in the unrepaired set, in contrast to the rows on 
> the sender site moved to the repaired set. The next repair run will stream 
> the same data back again, causing rows to bounce on and on between nodes on 
> each repair.
> See linked dtest on steps to reproduce. An example for reproducing this 
> manually using ccm can be found 
> [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC

2017-01-11 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818766#comment-15818766
 ] 

Benjamin Roth commented on CASSANDRA-12888:
---

It depends ;) there are known issues. Mostly related to repair and
streaming. MV basically work and do what you expect of them. But
maintenance jobs may be slow and or painful. So the good old saying is
true: you can use them if you understand them and know what you are doing.
But don't expect them to be like plug and play




> Incremental repairs broken for MVs and CDC
> --
>
> Key: CASSANDRA-12888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Stefan Podkowinski
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.0.x, 3.x
>
>
> SSTables streamed during the repair process will first be written locally and 
> afterwards either simply added to the pool of existing sstables or, in case 
> of existing MVs or active CDC, replayed on mutation basis:
> As described in {{StreamReceiveTask.OnCompletionRunnable}}:
> {quote}
> We have a special path for views and for CDC.
> For views, since the view requires cleaning up any pre-existing state, we 
> must put all partitions through the same write path as normal mutations. This 
> also ensures any 2is are also updated.
> For CDC-enabled tables, we want to ensure that the mutations are run through 
> the CommitLog so they can be archived by the CDC process on discard.
> {quote}
> Using the regular write path turns out to be an issue for incremental 
> repairs, as we loose the {{repaired_at}} state in the process. Eventually the 
> streamed rows will end up in the unrepaired set, in contrast to the rows on 
> the sender site moved to the repaired set. The next repair run will stream 
> the same data back again, causing rows to bounce on and on between nodes on 
> each repair.
> See linked dtest on steps to reproduce. An example for reproducing this 
> manually using ccm can be found 
> [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-13073) Optimize repair behaviour with MVs

2016-12-22 Thread Benjamin Roth (JIRA)
Benjamin Roth created CASSANDRA-13073:
-

 Summary: Optimize repair behaviour with MVs
 Key: CASSANDRA-13073
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13073
 Project: Cassandra
  Issue Type: Bug
Reporter: Benjamin Roth


I am referring to a discusson on the dev list about the MV streaming issues 
discussed in 12888.

It turned out that under some circumstances, repairing MVs can lead to 
inconsistencies. To remove that inconsistencies, it is necessary, to repair the 
base table first and then the MV again. These inconsistencies can be created 
both by read repair or CF/Keyspace repair.

Proposition:
- Exclude MVs from keyspace repairs
- Disable read repairs on MVs or transform them to a read repair of the base 
table (maybe complicated but possible)

Explanation:

- CF base has PK a and field b
- MV has PK a, b
- 2 nodes n1 + n2, no hints

- Initial state is a=1,b=1 at time t=0
- Node n2 goes down
- Mutation a=1, b=2 at time t=1
- Node n2 comes up and node n1 goes down
- Mutation a=1, b=3 at time t=2
- Node n1.mv contains: a1=1, b=3 + tombstone for a1=1,b=1
- Node n2.mv contains: a1=1, b=2 + tombstone for a1=1,b=1

When doing a repair on mv _before_ repairing base, mv would look like:
- Node n1.mv contains: a1=1, b=3 + tombstone for a1=1,b=1 + a1=1, b=2
- Node n2.mv contains: a1=1, b=2 + tombstone for a1=1,b=1 + a1=1, b=3

Repairing _only_ the base table would create the correct result:
- Node n1.mv contains: a1=1, b=3 + tombstone for a1=1,b=1 + tombstone for 
a1=1,b=2
- Node n2.mv contains: a1=1, b=3 + tombstone for a1=1,b=1 (TS for a1=2,b=2 
should not have been created as b=3 was there, which shadows b=2 and should not 
reach the MV at all)

All this does not apply if CASSANDRA-13066 will be implemented and enabled




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-13066) Fast repair with materialized views

2016-12-21 Thread Benjamin Roth (JIRA)
Benjamin Roth created CASSANDRA-13066:
-

 Summary: Fast repair with materialized views
 Key: CASSANDRA-13066
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13066
 Project: Cassandra
  Issue Type: Improvement
Reporter: Benjamin Roth


I propose adding a configuration option to send streams of tables with MVs not 
through the regular write path.
This may be either a global option or better a CF option.

Background:
A repair of a CF with an MV that is much out of sync creates many streams. 
These streams all go through the regular write path to assert local consistency 
of the MV. This again causes a read before write for every single mutation 
which again puts a lot of pressure on the node - much more than simply 
streaming the SSTable down.
In some cases this can be avoided. Instead of only repairing the base table, 
all base + mv tables would have to be repaired. But this can break eventual 
consistency between base table and MV. The proposed behaviour is always safe, 
when having append-only MVs. It also works when using CL_QUORUM writes but it 
cannot be absolutely guaranteed, that a quorum write is applied atomically, so 
this can also lead to inconsistencies, if a quorum write is started but one 
node dies in the middle of a request.

So, this proposal can help a lot in some situations but also can break 
consistency in others. That's why it should be left upon the operator if that 
behaviour is appropriate for individual use cases.

This issue came up here:
https://issues.apache.org/jira/browse/CASSANDRA-12888?focusedCommentId=15736599=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15736599



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths

2016-12-21 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766478#comment-15766478
 ] 

Benjamin Roth commented on CASSANDRA-13065:
---

I marked it as critical as it severly affects cluster maintainability when 
using MVs. Maybe it's also worth to be considered as a bug.

> Consistent range movements to not require MV updates to go through write 
> paths 
> ---
>
> Key: CASSANDRA-13065
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13065
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>Priority: Critical
>
> Booting or decommisioning nodes with MVs is unbearably slow as all streams go 
> through the regular write paths. This causes read-before-writes for every 
> mutation and during bootstrap it causes them to be sent to batchlog.
> The makes it virtually impossible to boot a new node in an acceptable amount 
> of time.
> Using the regular streaming behaviour for consistent range movements works 
> much better in this case and does not break the MV local consistency contract.
> Already tested on own cluster.
> Bootstrap case is super easy to handle, decommission case requires 
> CASSANDRA-13064



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths

2016-12-21 Thread Benjamin Roth (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Roth updated CASSANDRA-13065:
--
Priority: Critical  (was: Major)

> Consistent range movements to not require MV updates to go through write 
> paths 
> ---
>
> Key: CASSANDRA-13065
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13065
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>Priority: Critical
>
> Booting or decommisioning nodes with MVs is unbearably slow as all streams go 
> through the regular write paths. This causes read-before-writes for every 
> mutation and during bootstrap it causes them to be sent to batchlog.
> The makes it virtually impossible to boot a new node in an acceptable amount 
> of time.
> Using the regular streaming behaviour for consistent range movements works 
> much better in this case and does not break the MV local consistency contract.
> Already tested on own cluster.
> Bootstrap case is super easy to handle, decommission case requires 
> CASSANDRA-13064



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths

2016-12-21 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766474#comment-15766474
 ] 

Benjamin Roth commented on CASSANDRA-13065:
---

Required for decommissioning

> Consistent range movements to not require MV updates to go through write 
> paths 
> ---
>
> Key: CASSANDRA-13065
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13065
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>
> Booting or decommisioning nodes with MVs is unbearably slow as all streams go 
> through the regular write paths. This causes read-before-writes for every 
> mutation and during bootstrap it causes them to be sent to batchlog.
> The makes it virtually impossible to boot a new node in an acceptable amount 
> of time.
> Using the regular streaming behaviour for consistent range movements works 
> much better in this case and does not break the MV local consistency contract.
> Already tested on own cluster.
> Bootstrap case is super easy to handle, decommission case requires 
> CASSANDRA-13064



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-13065) Consistent range movements to not require MV updates to go through write paths

2016-12-21 Thread Benjamin Roth (JIRA)
Benjamin Roth created CASSANDRA-13065:
-

 Summary: Consistent range movements to not require MV updates to 
go through write paths 
 Key: CASSANDRA-13065
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13065
 Project: Cassandra
  Issue Type: Improvement
Reporter: Benjamin Roth


Booting or decommisioning nodes with MVs is unbearably slow as all streams go 
through the regular write paths. This causes read-before-writes for every 
mutation and during bootstrap it causes them to be sent to batchlog.
The makes it virtually impossible to boot a new node in an acceptable amount of 
time.
Using the regular streaming behaviour for consistent range movements works much 
better in this case and does not break the MV local consistency contract.

Already tested on own cluster.

Bootstrap case is super easy to handle, decommission case requires 
CASSANDRA-13064



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-13064) Add stream type or purpose to stream plan / stream

2016-12-21 Thread Benjamin Roth (JIRA)
Benjamin Roth created CASSANDRA-13064:
-

 Summary: Add stream type or purpose to stream plan / stream
 Key: CASSANDRA-13064
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13064
 Project: Cassandra
  Issue Type: Improvement
Reporter: Benjamin Roth


It would be very good to know the type or purpose of a certain stream on the 
receiver side. It should be both available in a stream request and a stream 
task.

Why?
It would be helpful to distinguish the purpose to allow different handling of 
streams and requests. Examples:
- In stream request a global flush is done. This is not necessary for all types 
of streams. A repair stream(-plan) does not require a flush as this has been 
done shortly before in validation compaction and only the sstables that have 
been validated also have to be streamed.
- In StreamReceiveTask streams for MVs go through the regular write path this 
is painfully slow especially on bootstrap and decomission. Both for bootstrap 
and decommission this is not necessary. Sstables can be directly streamed down 
in this case. Handling bootstrap is no problem as it relies on a local state 
but during decommission, the decom-state is bound to the sender and not the 
receiver, so the receiver has to know that it is safe to stream that sstable 
directly, not through the write-path. Thats why we have to know the purpose of 
the stream.

I'd love to implement this on my own but I am not sure how not to break the 
streaming protocol for backwards compat or if it is ok to do so.

Furthermore I'd love to get some feedback on that idea and some proposals what 
stream types to distinguish. I could imagine:
- bootstrap
- decommission
- repair
- replace node
- remove node
- range relocation

Comments like this support my idea, knowing the purpose could avoid this.
{quote}
// TODO each call to transferRanges re-flushes, this is 
potentially a lot of waste
streamPlan.transferRanges(newEndpoint, preferred, keyspaceName, 
ranges);
{/quote}

And alternative to passing the purpose of the stream was to pass flags like:
- requiresFlush
- requiresWritePathForMaterializedView
...

I guess passing the purpose will make the streaming protocol more robust for 
future changes and leaves decisions up to the receiver.
But an additional "requiresFlush" would also avoid putting too much logic into 
the streaming code. The streaming code should not care about purposes, the 
caller or receiver should. So the decision if a stream requires as flush before 
stream should be up to the stream requester and the stream request receiver 
depending on the purpose of the stream.

I'm excited about your feedback :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-13064) Add stream type or purpose to stream plan / stream

2016-12-21 Thread Benjamin Roth (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Roth updated CASSANDRA-13064:
--
Description: 
It would be very good to know the type or purpose of a certain stream on the 
receiver side. It should be both available in a stream request and a stream 
task.

Why?
It would be helpful to distinguish the purpose to allow different handling of 
streams and requests. Examples:
- In stream request a global flush is done. This is not necessary for all types 
of streams. A repair stream(-plan) does not require a flush as this has been 
done shortly before in validation compaction and only the sstables that have 
been validated also have to be streamed.
- In StreamReceiveTask streams for MVs go through the regular write path this 
is painfully slow especially on bootstrap and decomission. Both for bootstrap 
and decommission this is not necessary. Sstables can be directly streamed down 
in this case. Handling bootstrap is no problem as it relies on a local state 
but during decommission, the decom-state is bound to the sender and not the 
receiver, so the receiver has to know that it is safe to stream that sstable 
directly, not through the write-path. Thats why we have to know the purpose of 
the stream.

I'd love to implement this on my own but I am not sure how not to break the 
streaming protocol for backwards compat or if it is ok to do so.

Furthermore I'd love to get some feedback on that idea and some proposals what 
stream types to distinguish. I could imagine:
- bootstrap
- decommission
- repair
- replace node
- remove node
- range relocation

Comments like this support my idea, knowing the purpose could avoid this.
{quote}
// TODO each call to transferRanges re-flushes, this is 
potentially a lot of waste
streamPlan.transferRanges(newEndpoint, preferred, keyspaceName, 
ranges);
{quote}

And alternative to passing the purpose of the stream was to pass flags like:
- requiresFlush
- requiresWritePathForMaterializedView
...

I guess passing the purpose will make the streaming protocol more robust for 
future changes and leaves decisions up to the receiver.
But an additional "requiresFlush" would also avoid putting too much logic into 
the streaming code. The streaming code should not care about purposes, the 
caller or receiver should. So the decision if a stream requires as flush before 
stream should be up to the stream requester and the stream request receiver 
depending on the purpose of the stream.

I'm excited about your feedback :)

  was:
It would be very good to know the type or purpose of a certain stream on the 
receiver side. It should be both available in a stream request and a stream 
task.

Why?
It would be helpful to distinguish the purpose to allow different handling of 
streams and requests. Examples:
- In stream request a global flush is done. This is not necessary for all types 
of streams. A repair stream(-plan) does not require a flush as this has been 
done shortly before in validation compaction and only the sstables that have 
been validated also have to be streamed.
- In StreamReceiveTask streams for MVs go through the regular write path this 
is painfully slow especially on bootstrap and decomission. Both for bootstrap 
and decommission this is not necessary. Sstables can be directly streamed down 
in this case. Handling bootstrap is no problem as it relies on a local state 
but during decommission, the decom-state is bound to the sender and not the 
receiver, so the receiver has to know that it is safe to stream that sstable 
directly, not through the write-path. Thats why we have to know the purpose of 
the stream.

I'd love to implement this on my own but I am not sure how not to break the 
streaming protocol for backwards compat or if it is ok to do so.

Furthermore I'd love to get some feedback on that idea and some proposals what 
stream types to distinguish. I could imagine:
- bootstrap
- decommission
- repair
- replace node
- remove node
- range relocation

Comments like this support my idea, knowing the purpose could avoid this.
{quote}
// TODO each call to transferRanges re-flushes, this is 
potentially a lot of waste
streamPlan.transferRanges(newEndpoint, preferred, keyspaceName, 
ranges);
{/quote}

And alternative to passing the purpose of the stream was to pass flags like:
- requiresFlush
- requiresWritePathForMaterializedView
...

I guess passing the purpose will make the streaming protocol more robust for 
future changes and leaves decisions up to the receiver.
But an additional "requiresFlush" would also avoid putting too much logic into 
the streaming code. The streaming code should not care about purposes, the 
caller or receiver should. So the decision if a stream requires as flush before 
stream should be up to the stream requester and the stream request receiver 

[jira] [Comment Edited] (CASSANDRA-12905) Retry acquire MV lock on failure instead of throwing WTE on streaming

2016-12-15 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750958#comment-15750958
 ] 

Benjamin Roth edited comment on CASSANDRA-12905 at 12/15/16 10:00 AM:
--

+1 for the dtest!

Thanks for explaining that hint-thing.

For my understanding of hint retransmission:
If we have a hint file that is being processed and a WTE occurs in the middle, 
will the whole file be retransmitted or can it be resumed at the last 
successful position? 

I guess this is not the case from my personal observations. I had situations 
with > 1GB hint queues per sender node which were not going to shrink due to 
WTEs. It seemed like the same hints have been retransmitted over and over again 
from scratch instead of trying to resume. What helped in this situation was to 
pause hint delivery on all nodes but 1-2. I'm pretty sure the problem was WTEs 
due to lock contentions. To be honest, I did not try lowering 
hinted_handoff_throttle_in_kb or at least I don't remember.


was (Author: brstgt):
+1 for the dtest!

For my understanding of hint retransmission:
If we have a hint file that is being processed and a WTE occurs in the middle, 
will the whole file be retransmitted or can it be resumed at the last 
successful position? 

I guess this is not the case from my personal observations. I had situations 
with > 1GB hint queues per sender node which were not going to shrink due to 
WTEs. It seemed like the same hints have been retransmitted over and over again 
from scratch instead of trying to resume. What helped in this situation was to 
pause hint delivery on all nodes but 1-2. I'm pretty sure the problem was WTEs 
due to lock contentions. To be honest, I did not try lowering 
hinted_handoff_throttle_in_kb or at least I don't remember.

> Retry acquire MV lock on failure instead of throwing WTE on streaming
> -
>
> Key: CASSANDRA-12905
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12905
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: centos 6.7 x86_64
>Reporter: Nir Zilka
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.10
>
>
> Hello,
> I performed two upgrades to the current cluster (currently 15 nodes, 1 DC, 
> private VLAN),
> first it was 2.2.5.1 and repair worked flawlessly,
> second upgrade was to 3.0.9 (with upgradesstables) and also repair worked 
> well,
> then i upgraded 2 weeks ago to 3.9 - and the repair problems started.
> there are several errors types from the system.log (different nodes) :
> - Sync failed between /xxx.xxx.xxx.xxx and /xxx.xxx.xxx.xxx
> - Streaming error occurred on session with peer xxx.xxx.xxx.xxx Operation 
> timed out - received only 0 responses
> - Remote peer xxx.xxx.xxx.xxx failed stream session
> - Session completed with the following error
> org.apache.cassandra.streaming.StreamException: Stream failed
> 
> i use 3.9 default configuration with the cluster settings adjustments (3 
> seeds, GossipingPropertyFileSnitch).
> streaming_socket_timeout_in_ms is the default (8640).
> i'm afraid from consistency problems while i'm not performing repair.
> Any ideas?
> Thanks,
> Nir.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12905) Retry acquire MV lock on failure instead of throwing WTE on streaming

2016-12-15 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750958#comment-15750958
 ] 

Benjamin Roth commented on CASSANDRA-12905:
---

+1 for the dtest!

For my understanding of hint retransmission:
If we have a hint file that is being processed and a WTE occurs in the middle, 
will the whole file be retransmitted or can it be resumed at the last 
successful position? 

I guess this is not the case from my personal observations. I had situations 
with > 1GB hint queues per sender node which were not going to shrink due to 
WTEs. It seemed like the same hints have been retransmitted over and over again 
from scratch instead of trying to resume. What helped in this situation was to 
pause hint delivery on all nodes but 1-2. I'm pretty sure the problem was WTEs 
due to lock contentions. To be honest, I did not try lowering 
hinted_handoff_throttle_in_kb or at least I don't remember.

> Retry acquire MV lock on failure instead of throwing WTE on streaming
> -
>
> Key: CASSANDRA-12905
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12905
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: centos 6.7 x86_64
>Reporter: Nir Zilka
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.10
>
>
> Hello,
> I performed two upgrades to the current cluster (currently 15 nodes, 1 DC, 
> private VLAN),
> first it was 2.2.5.1 and repair worked flawlessly,
> second upgrade was to 3.0.9 (with upgradesstables) and also repair worked 
> well,
> then i upgraded 2 weeks ago to 3.9 - and the repair problems started.
> there are several errors types from the system.log (different nodes) :
> - Sync failed between /xxx.xxx.xxx.xxx and /xxx.xxx.xxx.xxx
> - Streaming error occurred on session with peer xxx.xxx.xxx.xxx Operation 
> timed out - received only 0 responses
> - Remote peer xxx.xxx.xxx.xxx failed stream session
> - Session completed with the following error
> org.apache.cassandra.streaming.StreamException: Stream failed
> 
> i use 3.9 default configuration with the cluster settings adjustments (3 
> seeds, GossipingPropertyFileSnitch).
> streaming_socket_timeout_in_ms is the default (8640).
> i'm afraid from consistency problems while i'm not performing repair.
> Any ideas?
> Thanks,
> Nir.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-12991) Inter-node race condition in validation compaction

2016-12-15 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750908#comment-15750908
 ] 

Benjamin Roth edited comment on CASSANDRA-12991 at 12/15/16 9:43 AM:
-

I created a little script to calculate some possible scenarios: 
https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99

Output:
{quote}
 * Calculates the likeliness of a race condition leading to unnecessary repairs
 * @see https://issues.apache.org/jira/browse/CASSANDRA-12991
 *
 * This assumes that all writes are equally spread over all token ranges
 * and there is one subrange repair executed for each existing token range

3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 0.39%
Unnecessary range repairs per repair: 3.00

3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 1.56%
Unnecessary range repairs per repair: 12.00

8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 
req/s
Total Ranges: 2048
Likeliness for RC per range: 2.93%
Unnecessary range repairs per repair: 60.00

8 Nodes, 256 Tokens / Node, 20ms Mutation Latency, 1ms Validation Latency, 5000 
req/s
Total Ranges: 2048
Likeliness for RC per range: 5.37%
Unnecessary range repairs per repair: 110.00
{quote}

You may ask why I entered latencies like 10ms or 20ms - this seems quite high. 
It is indeed quite high for regular tables and a cluster that is not 
overloaded. Under these conditions, the latency is dominated by your network 
latency, so 1ms seems quite fair to me.
As soon as you use MVs and your cluster tends to overload, higher latencies are 
not unrealistic.
You have to take into account that an MV operation does read before write and 
the latency may vary very much. For MVs the latency is not (only) any more 
dominated by network latency but by MV lock aquisition and read before write. 
Both factors can introduce MUCH higher latencies, depending on concurrent 
operations on MV, number of SSTables, compaction strategy, just everything that 
affects read performance.
If your cluster is overloaded, these effects have an even higher impact.

I observed MANY situations on our production system where writes timed out 
during streaming because of lock contention and or RBW impacts. These 
situations mainly pop up during repair sessions when streams cause bulk 
mutation applies (see StreamReceiverTask path for MVs). Impact is even higher 
due to CASSANDRA-12888. Parallel repairs like e.g. reaper does, makes the 
situation even more unpredictable and increases "drifts" of nodes, like Node A 
is overloaded but Node B not because Node A receives a stream from a different 
repair but Node B does not.

This is a vicious circle driven several factors: 
- Stream puts pressure on nodes - especially larg(er) partitions
- hints tend to queue up
- hint delivery puts more pressure
- retransmission of failed hint delivery puts even more pressure
- latencies go up
- stream validations drift
- more (unnecessary) streams
- goto 0

This calculation example is just hypothetic. This *may* happen as calculated 
but it totally depends on the model, cluster dimensions, cluster load, write 
activity, distribution of writes and repair execution. I don't claim that 
fixing this issue will remove all MV performance problems but it may be helps 
to remove one impediment in the mentioned vicious circle.

My proposal is NOT to control flushes. This is far too complicated and wont 
help. A flush, whenever it may happen and whatever range it flushes may or may 
not contain a mutation that _should_ be there. What helps is to cut off all 
data retrospectively at a synchronized and fix timestamp when executing the 
validation. You can define a grace period (GP). When you start validation at VS 
on the repair coordinator, then you expect all mutations to arrive no later 
than VS that were created before VS - GP. That can be done at SSTable scanner 
level by filtering all events (cells, tombstones) after VS - GP during 
validation compaction. Something like the opposite of purging tombstones after 
GCGS.


was (Author: brstgt):
I created a little script to calculate some possible scenarios: 
https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99

Output:
{quote}
 * Calculates the likeliness of a race condition leading to unnecessary repairs
 * @see https://issues.apache.org/jira/browse/CASSANDRA-12991
 *
 * This assumes that all writes are equally spread over all token ranges
 * and there is one subrange repair executed for each existing token range

3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 0.39%
Unnecessary range repairs per repair: 3.00

3 Nodes, 256 Tokens / Node, 10ms Mutation 

[jira] [Comment Edited] (CASSANDRA-12991) Inter-node race condition in validation compaction

2016-12-15 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750908#comment-15750908
 ] 

Benjamin Roth edited comment on CASSANDRA-12991 at 12/15/16 9:34 AM:
-

I created a little script to calculate some possible scenarios: 
https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99

Output:
{quote}
 * Calculates the likeliness of a race condition leading to unnecessary repairs
 * @see https://issues.apache.org/jira/browse/CASSANDRA-12991
 *
 * This assumes that all writes are equally spread over all token ranges
 * and there is one subrange repair executed for each existing token range

3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 0.39%
Unnecessary range repairs per repair: 3.00

3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 1.56%
Unnecessary range repairs per repair: 12.00

8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 
req/s
Total Ranges: 2048
Likeliness for RC per range: 2.93%
Unnecessary range repairs per repair: 60.00

8 Nodes, 256 Tokens / Node, 20ms Mutation Latency, 1ms Validation Latency, 5000 
req/s
Total Ranges: 2048
Likeliness for RC per range: 5.37%
Unnecessary range repairs per repair: 110.00
{quote}

You may ask why I entered latencies like 10ms or 20ms - this seems quite high. 
It is indeed quite high for regular tables and a cluster that is not 
overloaded. Under these conditions, the latency is dominated by your network 
latency, so 1ms seems quite fair to me.
As soon as you use MVs and your cluster tends to overload, higher latencies are 
not unrealistic.
You have to take into account that an MV operation does read before write and 
the latency may vary very much. For MVs the latency is not (only) any more 
dominated by network latency but by MV lock aquisition and read before write. 
Both factors can introduce MUCH higher latencies, depending on concurrent 
operations on MV, number of SSTables, compaction strategy, just everything that 
affects read performance.
If your cluster is overloaded, these effects have an even higher impact.

I observed MANY situations on our production system where writes timed out 
during streaming because of lock contention and or RBW impacts. These 
situations mainly pop up during repair sessions when streams cause bulk 
mutation applies (see StreamReceiverTask path for MVs). Impact is even higher 
due to CASSANDRA-12888. Parallel repairs like e.g. reaper does, makes the 
situation even more unpredictable and increases "drifts" of nodes, like Node A 
is overloaded but Node B not because Node A receives a stream from a different 
repair but Node B does not.

This is a vicious circle driven several factors: 
- Stream puts pressure on nodes - especially larg(er) partitions
- hints tend to queue up
- hint delivery puts more pressure
- retransmission of failed hint delivery puts even more pressure
- latencies go up
- stream validations drift
- more (unnecessary) streams
- goto 0

This calculation example is just hypothetic. This *may* happen as calculated 
but it totally depends on the model, cluster dimensions, cluster load, write 
activity, distribution of writes and repair execution. I don't claim that 
fixing this issue will remove all MV performance problems but it may be helps 
to remove one impediment in the mentioned vicious circle.

My proposal is NOT to control flushes. This is far too complicated and wont 
help. A flush, whenever it may happen and whatever range it flushes may or may 
not contain a mutation that _should_ be there. The only thing that helps is to 
cut off all data retrospectively at a synchronized and fix timestamp when 
executing the validation. You can only define a grace period (GP). When you 
start validation at VS on the repair coordinator, then you expect all mutations 
to arrive no later than VS that were created before VS - GP. That can IMHO only 
be done at SSTable scanner level by filtering all events (cells, tombstones) 
after VS - GP during validation compaction. Something like the opposite of 
purging tombstones after GCGS.


was (Author: brstgt):
I created a little script to calculate some possible scenarios: 
https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99

Output:
{quote}
3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 0.39%
Unnecessary range repairs per repair: 3.00

3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 1.56%
Unnecessary range repairs per repair: 12.00

8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 
req/s
Total Ranges: 2048
Likeliness for 

[jira] [Comment Edited] (CASSANDRA-12991) Inter-node race condition in validation compaction

2016-12-15 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750908#comment-15750908
 ] 

Benjamin Roth edited comment on CASSANDRA-12991 at 12/15/16 9:34 AM:
-

I created a little script to calculate some possible scenarios: 
https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99

Output:
{quote}
3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 0.39%
Unnecessary range repairs per repair: 3.00

3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 1.56%
Unnecessary range repairs per repair: 12.00

8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 
req/s
Total Ranges: 2048
Likeliness for RC per range: 2.93%
Unnecessary range repairs per repair: 60.00

8 Nodes, 256 Tokens / Node, 20ms Mutation Latency, 1ms Validation Latency, 5000 
req/s
Total Ranges: 2048
Likeliness for RC per range: 5.37%
Unnecessary range repairs per repair: 110.00
{quote}

You may ask why I entered latencies like 10ms or 20ms - this seems quite high. 
It is indeed quite high for regular tables and a cluster that is not 
overloaded. Under these conditions, the latency is dominated by your network 
latency, so 1ms seems quite fair to me.
As soon as you use MVs and your cluster tends to overload, higher latencies are 
not unrealistic.
You have to take into account that an MV operation does read before write and 
the latency may vary very much. For MVs the latency is not (only) any more 
dominated by network latency but by MV lock aquisition and read before write. 
Both factors can introduce MUCH higher latencies, depending on concurrent 
operations on MV, number of SSTables, compaction strategy, just everything that 
affects read performance.
If your cluster is overloaded, these effects have an even higher impact.

I observed MANY situations on our production system where writes timed out 
during streaming because of lock contention and or RBW impacts. These 
situations mainly pop up during repair sessions when streams cause bulk 
mutation applies (see StreamReceiverTask path for MVs). Impact is even higher 
due to CASSANDRA-12888. Parallel repairs like e.g. reaper does, makes the 
situation even more unpredictable and increases "drifts" of nodes, like Node A 
is overloaded but Node B not because Node A receives a stream from a different 
repair but Node B does not.

This is a vicious circle driven several factors: 
- Stream puts pressure on nodes - especially larg(er) partitions
- hints tend to queue up
- hint delivery puts more pressure
- retransmission of failed hint delivery puts even more pressure
- latencies go up
- stream validations drift
- more (unnecessary) streams
- goto 0

This calculation example is just hypothetic. This *may* happen as calculated 
but it totally depends on the model, cluster dimensions, cluster load, write 
activity, distribution of writes and repair execution. I don't claim that 
fixing this issue will remove all MV performance problems but it may be helps 
to remove one impediment in the mentioned vicious circle.

My proposal is NOT to control flushes. This is far too complicated and wont 
help. A flush, whenever it may happen and whatever range it flushes may or may 
not contain a mutation that _should_ be there. The only thing that helps is to 
cut off all data retrospectively at a synchronized and fix timestamp when 
executing the validation. You can only define a grace period (GP). When you 
start validation at VS on the repair coordinator, then you expect all mutations 
to arrive no later than VS that were created before VS - GP. That can IMHO only 
be done at SSTable scanner level by filtering all events (cells, tombstones) 
after VS - GP during validation compaction. Something like the opposite of 
purging tombstones after GCGS.


was (Author: brstgt):
I created a little script to calculate some possible scenarios: 
https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99

Output:
{quote}
3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 0.39%
Unnecessary range repairs per repair: 3.00

3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 1.56%
Unnecessary range repairs per repair: 12.00

8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 
req/s
Total Ranges: 2048
Likeliness for RC per range: 2.93%
Unnecessary range repairs per repair: 60.00

8 Nodes, 256 Tokens / Node, 20ms Mutation Latency, 1ms Validation Latency, 5000 
req/s
Total Ranges: 2048
Likeliness for RC per range: 5.37%
Unnecessary range repairs per repair: 110.00
{/quote}

You may ask why I entered 

[jira] [Comment Edited] (CASSANDRA-12991) Inter-node race condition in validation compaction

2016-12-15 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750908#comment-15750908
 ] 

Benjamin Roth edited comment on CASSANDRA-12991 at 12/15/16 9:33 AM:
-

I created a little script to calculate some possible scenarios: 
https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99

Output:
{quote}
3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 0.39%
Unnecessary range repairs per repair: 3.00

3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 1.56%
Unnecessary range repairs per repair: 12.00

8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 
req/s
Total Ranges: 2048
Likeliness for RC per range: 2.93%
Unnecessary range repairs per repair: 60.00

8 Nodes, 256 Tokens / Node, 20ms Mutation Latency, 1ms Validation Latency, 5000 
req/s
Total Ranges: 2048
Likeliness for RC per range: 5.37%
Unnecessary range repairs per repair: 110.00
{/quote}

You may ask why I entered latencies like 10ms or 20ms - this seems quite high. 
It is indeed quite high for regular tables and a cluster that is not 
overloaded. Under these conditions, the latency is dominated by your network 
latency, so 1ms seems quite fair to me.
As soon as you use MVs and your cluster tends to overload, higher latencies are 
not unrealistic.
You have to take into account that an MV operation does read before write and 
the latency may vary very much. For MVs the latency is not (only) any more 
dominated by network latency but by MV lock aquisition and read before write. 
Both factors can introduce MUCH higher latencies, depending on concurrent 
operations on MV, number of SSTables, compaction strategy, just everything that 
affects read performance.
If your cluster is overloaded, these effects have an even higher impact.

I observed MANY situations on our production system where writes timed out 
during streaming because of lock contention and or RBW impacts. These 
situations mainly pop up during repair sessions when streams cause bulk 
mutation applies (see StreamReceiverTask path for MVs). Impact is even higher 
due to CASSANDRA-12888. Parallel repairs like e.g. reaper does, makes the 
situation even more unpredictable and increases "drifts" of nodes, like Node A 
is overloaded but Node B not because Node A receives a stream from a different 
repair but Node B does not.

This is a vicious circle driven several factors: 
- Stream puts pressure on nodes - especially larg(er) partitions
- hints tend to queue up
- hint delivery puts more pressure
- retransmission of failed hint delivery puts even more pressure
- latencies go up
- stream validations drift
- more (unnecessary) streams
- goto 0

This calculation example is just hypothetic. This *may* happen as calculated 
but it totally depends on the model, cluster dimensions, cluster load, write 
activity, distribution of writes and repair execution. I don't claim that 
fixing this issue will remove all MV performance problems but it may be helps 
to remove one impediment in the mentioned vicious circle.

My proposal is NOT to control flushes. This is far too complicated and wont 
help. A flush, whenever it may happen and whatever range it flushes may or may 
not contain a mutation that _should_ be there. The only thing that helps is to 
cut off all data retrospectively at a synchronized and fix timestamp when 
executing the validation. You can only define a grace period (GP). When you 
start validation at VS on the repair coordinator, then you expect all mutations 
to arrive no later than VS that were created before VS - GP. That can IMHO only 
be done at SSTable scanner level by filtering all events (cells, tombstones) 
after VS - GP during validation compaction. Something like the opposite of 
purging tombstones after GCGS.


was (Author: brstgt):
I created a little script to calculate some possible scenarios: 
https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99

Output:
{quote}
 * Calculates the likeliness of a race condition leading to unnecessary repairs
 * @see https://issues.apache.org/jira/browse/CASSANDRA-12991
 *
 * This assumes that all writes are equally spread over all token ranges
 * and there is one subrange repair executed for each existing token range

3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 0.39%
Unnecessary range repairs per repair: 3.00

3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 1.56%
Unnecessary range repairs per repair: 12.00

8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 
req/s
Total Ranges: 2048
Likeliness 

[jira] [Commented] (CASSANDRA-12991) Inter-node race condition in validation compaction

2016-12-15 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750908#comment-15750908
 ] 

Benjamin Roth commented on CASSANDRA-12991:
---

I created a little script to calculate some possible scenarios: 
https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99

Output:
{quote}
 * Calculates the likeliness of a race condition leading to unnecessary repairs
 * @see https://issues.apache.org/jira/browse/CASSANDRA-12991
 *
 * This assumes that all writes are equally spread over all token ranges
 * and there is one subrange repair executed for each existing token range

3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 0.39%
Unnecessary range repairs per repair: 3.00

3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 1.56%
Unnecessary range repairs per repair: 12.00

8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 
req/s
Total Ranges: 2048
Likeliness for RC per range: 2.93%
Unnecessary range repairs per repair: 60.00

8 Nodes, 256 Tokens / Node, 20ms Mutation Latency, 1ms Validation Latency, 5000 
req/s
Total Ranges: 2048
Likeliness for RC per range: 5.37%
Unnecessary range repairs per repair: 110.00
{/quote}

You may ask why I entered latencies like 10ms or 20ms - this seems quite high. 
It is indeed quite high for regular tables and a cluster that is not 
overloaded. Under these conditions, the latency is dominated by your network 
latency, so 1ms seems quite fair to me.
As soon as you use MVs and your cluster tends to overload, higher latencies are 
not unrealistic.
You have to take into account that an MV operation does read before write and 
the latency may vary very much. For MVs the latency is not (only) any more 
dominated by network latency but by MV lock aquisition and read before write. 
Both factors can introduce MUCH higher latencies, depending on concurrent 
operations on MV, number of SSTables, compaction strategy, just everything that 
affects read performance.
If your cluster is overloaded, these effects have an even higher impact.

I observed MANY situations on our production system where writes timed out 
during streaming because of lock contention and or RBW impacts. These 
situations mainly pop up during repair sessions when streams cause bulk 
mutation applies (see StreamReceiverTask path for MVs). Impact is even higher 
due to CASSANDRA-12888. Parallel repairs like e.g. reaper does, makes the 
situation even more unpredictable and increases "drifts" of nodes, like Node A 
is overloaded but Node B not because Node A receives a stream from a different 
repair but Node B does not.

This is a vicious circle driven several factors: 
- Stream puts pressure on nodes - especially larg(er) partitions
- hints tend to queue up
- hint delivery puts more pressure
- retransmission of failed hint delivery puts even more pressure
- latencies go up
- stream validations drift
- more (unnecessary) streams
- goto 0

This calculation example is just hypothetic. This *may* happen as calculated 
but it totally depends on the model, cluster dimensions, cluster load, write 
activity, distribution of writes and repair execution. I don't claim that 
fixing this issue will remove all MV performance problems but it may be helps 
to remove one impediment in the mentioned vicious circle.

My proposal is NOT to control flushes. This is far too complicated and wont 
help. A flush, whenever it may happen and whatever range it flushes may or may 
not contain a mutation that _should_ be there. The only thing that helps is to 
cut off all data retrospectively at a synchronized and fix timestamp when 
executing the validation. You can only define a grace period (GP). When you 
start validation at VS on the repair coordinator, then you expect all mutations 
to arrive no later than VS that were created before VS - GP. That can IMHO only 
be done at SSTable scanner level by filtering all events (cells, tombstones) 
after VS - GP during validation compaction. Something like the opposite of 
purging tombstones after GCGS.

> Inter-node race condition in validation compaction
> --
>
> Key: CASSANDRA-12991
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12991
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benjamin Roth
>Priority: Minor
>
> Problem:
> When a validation compaction is triggered by a repair it may happen that due 
> to flying in mutations the merkle trees differ but the data is consistent 
> however.
> Example:
> t = 1: 
> Repair starts, triggers validations
> Node A starts validation
> t = 10001:
> Mutation arrives at Node A
> t = 10002:
> Mutation arrives at 

[jira] [Commented] (CASSANDRA-12905) Retry acquire MV lock on failure instead of throwing WTE on streaming

2016-12-13 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747566#comment-15747566
 ] 

Benjamin Roth commented on CASSANDRA-12905:
---

I like your naming changes, they make sense. But by making hint delivery async, 
you made it droppable again. I guess this is not intentional? You also do not 
reply on an exception. I am not familiar with request/response handling but I 
guess an exception (like WTE) will just drop the hint and let the hint-sender 
wait for a reply infinitely or until it times out?
The hint sender should be able to recover from an exception like 
re-transmitting the hints, right? I am not sure if this is the case here.

> Retry acquire MV lock on failure instead of throwing WTE on streaming
> -
>
> Key: CASSANDRA-12905
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12905
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: centos 6.7 x86_64
>Reporter: Nir Zilka
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.10
>
>
> Hello,
> I performed two upgrades to the current cluster (currently 15 nodes, 1 DC, 
> private VLAN),
> first it was 2.2.5.1 and repair worked flawlessly,
> second upgrade was to 3.0.9 (with upgradesstables) and also repair worked 
> well,
> then i upgraded 2 weeks ago to 3.9 - and the repair problems started.
> there are several errors types from the system.log (different nodes) :
> - Sync failed between /xxx.xxx.xxx.xxx and /xxx.xxx.xxx.xxx
> - Streaming error occurred on session with peer xxx.xxx.xxx.xxx Operation 
> timed out - received only 0 responses
> - Remote peer xxx.xxx.xxx.xxx failed stream session
> - Session completed with the following error
> org.apache.cassandra.streaming.StreamException: Stream failed
> 
> i use 3.9 default configuration with the cluster settings adjustments (3 
> seeds, GossipingPropertyFileSnitch).
> streaming_socket_timeout_in_ms is the default (8640).
> i'm afraid from consistency problems while i'm not performing repair.
> Any ideas?
> Thanks,
> Nir.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC

2016-12-09 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737403#comment-15737403
 ] 

Benjamin Roth commented on CASSANDRA-12888:
---

HI Paulo,

Thanks for the congrats!

About your proposal to skip the base table mutations:
I haven't analyzed it thoroughly (no time, you know) but my intuition says that 
there will be race conditions and possible inconsistencies as you "pick out" 
the base table mutation out of the lock phase. I guess to assert the base table 
<> view replica consistency you'd have to lock the whole CF while streaming a 
single SSTable to assert that the MV mutations are processed serial and no 
other base table mutations slip in from the mutation stage that mess with the 
consistency.
As far as i can see, base table apply, base-read and MV mutations MUST be 
serialized (actually that's why there's a lock). Otherwise you will have stale 
MV rows again. This is why I think this proposal won't work. Or did I miss the 
point?

CDC:
This case should be quite simple. I think you don't need the write path at all 
and just have to write the incoming mutations to commit log additionally to 
streaming the sstable. In the worst case, server crashed, commit log replay 
leads to redundant and unrepaired entries but this should be a rare and 
recoverable situation.

What do you think?

> Incremental repairs broken for MVs and CDC
> --
>
> Key: CASSANDRA-12888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Stefan Podkowinski
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.0.x, 3.x
>
>
> SSTables streamed during the repair process will first be written locally and 
> afterwards either simply added to the pool of existing sstables or, in case 
> of existing MVs or active CDC, replayed on mutation basis:
> As described in {{StreamReceiveTask.OnCompletionRunnable}}:
> {quote}
> We have a special path for views and for CDC.
> For views, since the view requires cleaning up any pre-existing state, we 
> must put all partitions through the same write path as normal mutations. This 
> also ensures any 2is are also updated.
> For CDC-enabled tables, we want to ensure that the mutations are run through 
> the CommitLog so they can be archived by the CDC process on discard.
> {quote}
> Using the regular write path turns out to be an issue for incremental 
> repairs, as we loose the {{repaired_at}} state in the process. Eventually the 
> streamed rows will end up in the unrepaired set, in contrast to the rows on 
> the sender site moved to the repaired set. The next repair run will stream 
> the same data back again, causing rows to bounce on and on between nodes on 
> each repair.
> See linked dtest on steps to reproduce. An example for reproducing this 
> manually using ccm can be found 
> [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-12905) Retry acquire MV lock on failure instead of throwing WTE on streaming

2016-12-09 Thread Benjamin Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15735402#comment-15735402
 ] 

Benjamin Roth edited comment on CASSANDRA-12905 at 12/9/16 2:02 PM:


Hi Paulo,

Indeed, my time is currently VERY limited, my wife came back home from hospital 
with our new born child. So currently I am not able to think about all that 
with the concentration it requires and deserves. So I suggest, you apply 
cosmetic changes on your own - after all you have much more experience with the 
code base, so I'd leave these decisions up to you anyway. I personally would 
not know how to interpret the name "droppable" but if you say there is a 
pattern that is also used somewhere else, why not.
According to your feedback I absolutely support your decision to overwork and 
or overthink this issue and address it later for not to block the release.

My plan was to get back to all that issues next week when (hopefully) my CPU is 
a little bit more idle again :) Probably I will then ask for help / a second 
brain.

Thanks so far!


was (Author: brstgt):
Hi Pualo,

Indeed, my time is currently VERY limited, my wife came back home from hospital 
with our new born child. So currently I am not able to think about all that 
with the concentration it requires and deserves. So I suggest, you apply 
cosmetic changes on your own - after all you have much more experience with the 
code base, so I'd leave these decisions up to you anyway. I personally would 
not know how to interpret the name "droppable" but if you say there is a 
pattern that is also used somewhere else, why not.
According to your feedback I absolutely support your decision to overwork and 
or overthink this issue and address it later for not to block the release.

My plan was to get back to all that issues next week when (hopefully) my CPU is 
a little bit more idle again :) Probably I will then ask for help / a second 
brain.

Thanks so far!

> Retry acquire MV lock on failure instead of throwing WTE on streaming
> -
>
> Key: CASSANDRA-12905
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12905
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: centos 6.7 x86_64
>Reporter: Nir Zilka
>Assignee: Benjamin Roth
>Priority: Critical
> Fix For: 3.10
>
>
> Hello,
> I performed two upgrades to the current cluster (currently 15 nodes, 1 DC, 
> private VLAN),
> first it was 2.2.5.1 and repair worked flawlessly,
> second upgrade was to 3.0.9 (with upgradesstables) and also repair worked 
> well,
> then i upgraded 2 weeks ago to 3.9 - and the repair problems started.
> there are several errors types from the system.log (different nodes) :
> - Sync failed between /xxx.xxx.xxx.xxx and /xxx.xxx.xxx.xxx
> - Streaming error occurred on session with peer xxx.xxx.xxx.xxx Operation 
> timed out - received only 0 responses
> - Remote peer xxx.xxx.xxx.xxx failed stream session
> - Session completed with the following error
> org.apache.cassandra.streaming.StreamException: Stream failed
> 
> i use 3.9 default configuration with the cluster settings adjustments (3 
> seeds, GossipingPropertyFileSnitch).
> streaming_socket_timeout_in_ms is the default (8640).
> i'm afraid from consistency problems while i'm not performing repair.
> Any ideas?
> Thanks,
> Nir.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   >