[jira] [Comment Edited] (CASSANDRA-10331) Establish and implement canonical bulk reading workload(s)

2016-03-13 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188861#comment-15188861
 ] 

Stefania edited comment on CASSANDRA-10331 at 3/14/16 3:02 AM:
---

In terms of establishing the canonical bulk reading workload, I suggest using 
the profiles available in [this 
benchmark|https://github.com/stef1927/cstar_bulk_read_test] with single 
clustering and row sizes of approximately 100 bytes, 1 kb and 10 kb. These 
correspond to the following user profiles:

* 
[bulk_read_100_bytes.yaml|https://github.com/stef1927/cstar_bulk_read_test/blob/master/bulk_read_100_bytes.yaml]
* 
[bulk_read_1_kbyte.yaml|https://github.com/stef1927/cstar_bulk_read_test/blob/master/bulk_read_1_kbyte.yaml]
* 
[bulk_read_10_kbytes.yaml|https://github.com/stef1927/cstar_bulk_read_test/blob/master/bulk_read_10_kbytes.yaml]

Here are sample cassandra-stress commands to generate the data and read it back:

{code}
cassandra-stress user profile=bulk_read_1_kbyte.yaml ops\(insert=1\) n=5M -rate 
threads=25
cassandra-stress user profile=bulk_read_1_kbyte.yaml 
ops\(all_columns_tr_query=1\) -rate threads=25
{code}




was (Author: stefania):
In terms of establishing the canonical bulk reading workload, I suggest using 
the profiles available in [this 
benchmark|https://github.com/stef1927/cstar_bulk_read_test] with single 
clustering and row sizes of approximately 100 bytes, 1 kb and 10 kb. These 
correspond to the following user profiles:

* 
[bulk_read_value50_cluster1.yaml|https://github.com/stef1927/cstar_bulk_read_test/blob/master/bulk_read_value50_cluster1.yaml]
* 
[bulk_read_value500_cluster1.yaml|https://github.com/stef1927/cstar_bulk_read_test/blob/master/bulk_read_value500_cluster1.yaml]
* 
[bulk_read_value5k_cluster1.yaml|https://github.com/stef1927/cstar_bulk_read_test/blob/master/bulk_read_value5k_cluster1.yaml]

Here are sample cassandra-stress commands to generate the data and read it back:

{code}
cassandra-stress user profile=bulk_read_value50_cluster1.yaml ops\(insert=1\) 
n=5M -rate threads=25
cassandra-stress user profile=bulk_read_value50_cluster1.yaml 
ops\(all_columns_tr_query=1\) -rate threads=25
{code}



> Establish and implement canonical bulk reading workload(s)
> --
>
> Key: CASSANDRA-10331
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10331
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Ariel Weisberg
>Assignee: Stefania
>  Labels: docs-impacting
> Fix For: 3.0.5, 3.5
>
>
> Implement a client, use stress, or extend stress to a bulk reading workload 
> that is indicative of the performance we are trying to improve.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory

2016-03-13 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15192380#comment-15192380
 ] 

Ariel Weisberg commented on CASSANDRA-11327:


I think that even if we only provided a mechanism and the default policy was to 
maximize memtable utilization before backpressure that would be a big 
improvement. We could get feedback on what behavior people prefer.

It seems like the commit log should not be any more of a bottleneck than it is 
now. If the CL was able to go fast enough that it could fill up the memtables 
then it should have enough capacity to do that indefinitely since it doesn't 
really defer work like flushing or compaction.

Yes it's off topic, but I'll make sure it's copied over to an implementation 
ticket if we get there.

> Maintain a histogram of times when writes are blocked due to no available 
> memory
> 
>
> Key: CASSANDRA-11327
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11327
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Core
>Reporter: Ariel Weisberg
>
> I have a theory that part of the reason C* is so sensitive to timeouts during 
> saturating write load is that throughput is basically a sawtooth with valleys 
> at zero. This is something I have observed and it gets worse as you add 2i to 
> a table or do anything that decreases the throughput of flushing.
> I think the fix for this is to incrementally release memory pinned by 
> memtables and 2i during flushing instead of releasing it all at once. I know 
> that's not really possible, but we can fake it with memory accounting that 
> tracks how close to completion flushing is and releases permits for 
> additional memory. This will lead to a bit of a sawtooth in real memory 
> usage, but we can account for that so the peak footprint is the same.
> I think the end result of this change will be a sawtooth, but the valley of 
> the sawtooth will not be zero it will be the rate at which flushing 
> progresses. Optimizing the rate at which flushing progresses and it's 
> fairness with other work can then be tackled separately.
> Before we do this I think we should demonstrate that pinned memory due to 
> flushing is actually the issue by getting better visibility into the 
> distribution of instances of not having any memory by maintaining a histogram 
> of spans of time where no memory is available and a thread is blocked.
> [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186]
>  should be a relatively straightforward entry point for this. The first 
> thread to block can mark the start of memory starvation and the last thread 
> out can mark the end. Have a periodic task that tracks the amount of time 
> spent blocked per interval of time and if it is greater than some threshold 
> log with more details, possibly at debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11310) Allow filtering on clustering columns for queries without secondary indexes

2016-03-13 Thread Alex Petrov (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-11310:

Attachment: (was: 
0001-Allow-filtering-on-clustering-columns-for-queries-wi.patch)

> Allow filtering on clustering columns for queries without secondary indexes
> ---
>
> Key: CASSANDRA-11310
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11310
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
>Reporter: Benjamin Lerer
>Assignee: Alex Petrov
>  Labels: doc-impacting
> Fix For: 3.x
>
>
> Since CASSANDRA-6377 queries without index filtering non-primary key columns 
> are fully supported.
> It makes sense to also support filtering on clustering-columns.
> {code}
> CREATE TABLE emp_table2 (
> empID int,
> firstname text,
> lastname text,
> b_mon text,
> b_day text,
> b_yr text,
> PRIMARY KEY (empID, b_yr, b_mon, b_day));
> SELECT b_mon,b_day,b_yr,firstname,lastname FROM emp_table2
> WHERE b_mon='oct' ALLOW FILTERING;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11310) Allow filtering on clustering columns for queries without secondary indexes

2016-03-13 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189913#comment-15189913
 ] 

Alex Petrov edited comment on CASSANDRA-11310 at 3/13/16 1:59 PM:
--

I've attached a very rough version, mostly to understand if it generally goes 
the right direction. 

I've tried to avoid adding {{useFiltering}} to the 
{{PrimaryKeyRestrictionSet}}, although that'd require exposing {{Restrictions}} 
iterator since the validations are different for queries that allow filtering. 
So I've left it "as-is" for now.

Other than that - I've tried to cover several types of queries in tests. 

I'm still not sure if the handling of multi-columns is correct, since their 
{{SliceRestriction::addRowFilterTo}} is not permitted at the moment.

{{CONAINS}} works as expected, too.

I've added the filtering for multicolumn slices, too: 
https://github.com/ifesdjeen/cassandra/commits/cassandra-11310


was (Author: ifesdjeen):
I've attached a very rough version, mostly to understand if it generally goes 
the right direction. 

I've tried to avoid adding {{useFiltering}} to the 
{{PrimaryKeyRestrictionSet}}, although that'd require exposing {{Restrictions}} 
iterator since the validations are different for queries that allow filtering. 
So I've left it "as-is" for now.

Other than that - I've tried to cover several types of queries in tests. 

I'm still not sure if the handling of multi-columns is correct, since their 
{{SliceRestriction::addRowFilterTo}} is not permitted at the moment.

{{CONAINS}} works as expected, too.

I haven't yet ran / tried dtests, will do it hopefully tomorrow. 

> Allow filtering on clustering columns for queries without secondary indexes
> ---
>
> Key: CASSANDRA-11310
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11310
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
>Reporter: Benjamin Lerer
>Assignee: Alex Petrov
>  Labels: doc-impacting
> Fix For: 3.x
>
>
> Since CASSANDRA-6377 queries without index filtering non-primary key columns 
> are fully supported.
> It makes sense to also support filtering on clustering-columns.
> {code}
> CREATE TABLE emp_table2 (
> empID int,
> firstname text,
> lastname text,
> b_mon text,
> b_day text,
> b_yr text,
> PRIMARY KEY (empID, b_yr, b_mon, b_day));
> SELECT b_mon,b_day,b_yr,firstname,lastname FROM emp_table2
> WHERE b_mon='oct' ALLOW FILTERING;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-03-13 Thread Fabien Rousseau (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabien Rousseau updated CASSANDRA-11349:

Description: 
We observed that repair, for some of our clusters, streamed a lot of data and 
many partitions were "out of sync".
Moreover, the read repair mismatch ratio is around 3% on those clusters, which 
is really high.

After investigation, it appears that, if two range tombstones exists for a 
partition for the same range/interval, they're both included in the merkle tree 
computation.
But, if for some reason, on another node, the two range tombstones were already 
compacted into a single range tombstone, this will result in a merkle tree 
difference.
Currently, this is clearly bad because MerkleTree differences are dependent on 
compactions (and if a partition is deleted and created multiple times, the only 
way to ensure that repair "works correctly"/"don't overstream data" is to major 
compact before each repair... which is not really feasible).

Below is a list of steps allowing to easily reproduce this case:
{noformat}
ccm create test -v 2.1.13 -n 2 -s
ccm node1 cqlsh
CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': 2};
USE test_rt;
CREATE TABLE IF NOT EXISTS table1 (
c1 text,
c2 text,
c3 float,
c4 float,
PRIMARY KEY ((c1), c2)
);
INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
ctrl ^d
# now flush only one of the two nodes
ccm node1 flush 
ccm node1 cqlsh
USE test_rt;
INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
ctrl ^d
ccm node1 repair
# now grep the log and observe that there was some inconstencies detected 
between nodes (while it shouldn't have detected any)
ccm node1 showlog | grep "out of sync"
{noformat}
Consequences of this are a costly repair, accumulating many small SSTables (up 
to thousands for a rather short period of time when using VNodes, the time for 
compaction to absorb those small files), but also an increased size on disk.


  was:
We observed that repair, for some of our clusters, streamed a lot of data and 
many partitions were "out of sync".
Moreover, the read repair mismatch ratio is around 3% on those clusters, which 
is really high.

After investigation, it appears that, if two range tombstones exists for a 
partition for the same range/interval, they're both included in the merkle tree 
computation.
But, if for some reason, on another node, the two range tombstones were already 
compacted into a single range tombstone, this will result in a merkle tree 
difference.
Currently, this is clearly bad because MerkleTree differences are dependent on 
compactions (and if a partition is deleted and created multiple times, the only 
way to ensure that repair "works correctly"/"don't overstream data" is to major 
compact before each repair... which is not really feasible).

Below is a list of steps allowing to easily reproduce this case:

ccm create test -v 2.1.13 -n 2 -s
ccm node1 cqlsh
CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': 2};
USE test_rt;
CREATE TABLE IF NOT EXISTS table1 (
c1 text,
c2 text,
c3 float,
c4 float,
PRIMARY KEY ((c1), c2)
);
INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
ctrl ^d
# now flush only one of the two nodes
ccm node1 flush 
ccm node1 cqlsh
USE test_rt;
INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
ctrl ^d
ccm node1 repair
# now grep the log and observe that there was some inconstencies detected 
between nodes (while it shouldn't have detected any)
ccm node1 showlog | grep "out of sync"

Consequences of this are a costly repair, accumulating many small SSTables (up 
to thousands for a rather short period of time when using VNodes, the time for 
compaction to absorb those small files), but also an increased size on disk.



> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on 

[jira] [Created] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-03-13 Thread Fabien Rousseau (JIRA)
Fabien Rousseau created CASSANDRA-11349:
---

 Summary: MerkleTree mismatch when multiple range tombstones exists 
for the same partition and interval
 Key: CASSANDRA-11349
 URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
 Project: Cassandra
  Issue Type: Bug
Reporter: Fabien Rousseau


We observed that repair, for some of our clusters, streamed a lot of data and 
many partitions were "out of sync".
Moreover, the read repair mismatch ratio is around 3% on those clusters, which 
is really high.

After investigation, it appears that, if two range tombstones exists for a 
partition for the same range/interval, they're both included in the merkle tree 
computation.
But, if for some reason, on another node, the two range tombstones were already 
compacted into a single range tombstone, this will result in a merkle tree 
difference.
Currently, this is clearly bad because MerkleTree differences are dependent on 
compactions (and if a partition is deleted and created multiple times, the only 
way to ensure that repair "works correctly"/"don't overstream data" is to major 
compact before each repair... which is not really feasible).

Below is a list of steps allowing to easily reproduce this case:

ccm create test -v 2.1.13 -n 2 -s
ccm node1 cqlsh
CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': 2};
USE test_rt;
CREATE TABLE IF NOT EXISTS table1 (
c1 text,
c2 text,
c3 float,
c4 float,
PRIMARY KEY ((c1), c2)
);
INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
ctrl ^d
# now flush only one of the two nodes
ccm node1 flush 
ccm node1 cqlsh
USE test_rt;
INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
ctrl ^d
ccm node1 repair
# now grep the log and observe that there was some inconstencies detected 
between nodes (while it shouldn't have detected any)
ccm node1 showlog | grep "out of sync"

Consequences of this are a costly repair, accumulating many small SSTables (up 
to thousands for a rather short period of time when using VNodes, the time for 
compaction to absorb those small files), but also an increased size on disk.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)