Re: Keyspace and table/cf limits

2014-12-03 Thread Nikolai Grigoriev
We had the similar problem - multi-tenancy and multiple DC support. But we
did not really have strict requirement of one keyspace per tenant. Our row
keys allow us to put any number of tenants per keyspace.

So, on one side - we could put all data in a single keyspace for all
tenants. And size the cluster for it, at the end the total amount of data
would be the same :)

However, we wanted different replication strategy for different customers.
And the replication strategy is a keyspace setting. Thus, it wold be
simpler to have one keyspace per customer.

The cost, as it was mentioned, is per CF. The more keyspaces we have, the
more CFs we have. So we did not want this to be too high.

The decision we've made was to have something in between. We'd define a
number of keyspaces with different replication strategies (possibly even
duplicate ones) and map tenants to these keyspaces. Thus, there would be a
couple of tenants in one keyspace all sharing the same properties
(replication strategy in our case). We could even create a keyspace that
will group some tenants that currently share the same replication
requirements and that may be moved/replicated to a specific DC in the
future.

On Wed, Dec 3, 2014 at 4:54 PM, Raj N raj.cassan...@gmail.com wrote:

 The question is more from a multi-tenancy point of view. We wanted to see
 if we can have a keyspace per client. Each keyspace may have 50 column
 families, but if we have 200 clients, that would be 10,000 column families.
 Do you think that's reasonable to support? I know that key cache capacity
 is reserved in heap still. Any plans to move it off-heap?

 -Raj

 On Tue, Nov 25, 2014 at 3:10 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Nov 25, 2014 at 9:07 AM, Raj N raj.cassan...@gmail.com wrote:

 What's the latest on the maximum number of keyspaces and/or tables that
 one can have in Cassandra 2.1.x?


 Most relevant changes lately would be :

 https://issues.apache.org/jira/browse/CASSANDRA-6689
 and
 https://issues.apache.org/jira/browse/CASSANDRA-6694

 Which should meaningfully reduce the amount of heap memtables consume.
 That heap can then be used to support more heap-persistent structures
 associated with many CFs. I have no idea how to estimate the scale of the
 improvement.

 As a general/meta statement, Cassandra is very multi-threaded, and
 consumes file handles like crazy. How many different query cases do you
 really want to put on one cluster/node? ;D

 =Rob





-- 
Nikolai Grigoriev
(514) 772-5178


Re: opscenter: 0 of 0 agents connected, but /nodes/all gives 3 results

2014-12-02 Thread Nikolai Grigoriev
I have observed this kind of situation with 0 agents connected. Restarting
the agents always helped so far. By the way, check the agent's logs and
opscenterd logs, there may be some clues there.

On Tue, Dec 2, 2014 at 4:59 PM, Ian Rose ianr...@fullstory.com wrote:

 Hi all -

 Just getting started setting up OpsCenter today.  I have a 3 node
 cassandra cluster and (afaict) the agent installed and running happily on
 all 3 nodes.  I also have OpsCenter up and running on a 4th node.  I do not
 have SSL enabled between these nodes.

 In the OpsCenter interface, I see 0 of 0 agents connected and restarting
 OpsCenter does not change this.  However, in my cluster settings I have the
 following entered for the least one host / IP in the cluster portion, and
 these hostnames are all resolvable/pingable from the OpsCenter machine.

 cassandra-db-1
 cassandra-db-2
 cassandra-db-3

 Also, from this post
 http://www.datastax.com/support-forums/topic/agents-running-without-error-but-not-connecting-to-opscenter
  I
 found a suggestion to curl cluster/nodes/all and when I do that I (think
 I) get results showing 3 nodes.

 Any suggestions?  Thanks!
 - Ian


 curl http://localhost:/FullStory/nodes/all


 [{load: 0.0, has_jna: false, vnodes: true, devices:
 {saved_caches: sdb, commitlog: sdb, other: [sda], data:
 [sdb]}, task_progress: {}, node_ip: 10.240.167.187,
 network_interfaces: [eth0, lo], ec2: {instance-type: null,
 placement: null, ami-id: null, instance-id: null}, node_version:
 {search: null, jobtracker: null, tasktracker: null, spark:
 {master: null, version: null, worker: null}, dse: null,
 cassandra: 2.0.9}, dc: us-central1, node_name:
 cassandra-db-3.c.fs-staging.internal, num_procs: 1, streaming: {},
 token: -3220529141950148793, data_held: 54064133.0, mode: normal,
 rpc_ip: 10.240.167.187, partitions: {saved_caches: /dev/sdb,
 commitlog: /dev/sdb, other:
 [/dev/disk/by_uuid/121fd70f_e625_4394_89ce_5e5eff3d25e0, rootfs],
 data: [/dev/sdb]}, os: linux, rack: f, last_seen: 0},
 {load: 0.0, has_jna: false, vnodes: true, devices: {saved_caches:
 sdb, commitlog: sdb, other: [sda], data: [sdb]},
 task_progress: {}, node_ip: 10.240.111.79, network_interfaces:
 [eth0, lo], ec2: {instance-type: null, placement: null, ami-id:
 null, instance-id: null}, node_version: {search: null, jobtracker:
 null, tasktracker: null, spark: {master: null, version: null,
 worker: null}, dse: null, cassandra: 2.0.9}, dc: us-central1,
 node_name: cassandra-db-1.c.fs-staging.internal, num_procs: 1,
 streaming: {}, token: 6048217978730786970, data_held: 54319361.0,
 mode: normal, rpc_ip: 10.240.111.79, partitions: {saved_caches:
 /dev/sdb, commitlog: /dev/sdb, other:
 [/dev/disk/by_uuid/2795e34f_4e9b_46a9_9e39_e6cc0f1f1478, rootfs],
 data: [/dev/sdb]}, os: linux, rack: a, last_seen: 0},
 {load: 0.04, has_jna: false, vnodes: true, devices:
 {saved_caches: sdb, commitlog: sdb, other: [sda], data:
 [sdb]}, task_progress: {}, node_ip: 10.240.129.185,
 network_interfaces: [eth0, lo], ec2: {instance-type: null,
 placement: null, ami-id: null, instance-id: null}, node_version:
 {search: null, jobtracker: null, tasktracker: null, spark:
 {master: null, version: null, worker: null}, dse: null,
 cassandra: 2.0.9}, dc: us-central1, node_name:
 cassandra-db-2.c.fs-staging.internal, num_procs: 1, streaming: {},
 token: 4393271482730462700, data_held: 51388793.0, mode: normal,
 rpc_ip: 10.240.129.185, partitions: {saved_caches: /dev/sdb,
 commitlog: /dev/sdb, other:
 [/dev/disk/by_uuid/2795e34f_4e9b_46a9_9e39_e6cc0f1f1478, rootfs],
 data: [/dev/sdb]}, os: linux, rack: b, last_seen: 0}]




-- 
Nikolai Grigoriev
(514) 772-5178


Re: does safe cassandra shutdown require disable binary?

2014-12-01 Thread Nikolai Grigoriev
Personally I believe that you do not have to do these steps just to perform
the restart.  I know the node will start faster if drained before shutdown
but according to my experience these steps make the restart process
slightly longer (I mean stop + start phase, total). So if it is really
about rolling restart to apply some JVM or C* settings I would simply kill
it and start.

On Mon, Dec 1, 2014 at 12:02 AM, Kevin Burton bur...@spinn3r.com wrote:

 I’m trying to figure out a safe way to do a rolling restart.

 http://devblog.michalski.im/2012/11/25/safe-cassandra-shutdown-and-restart/

 It has the following command which make sense:

 root@cssa01:~# nodetool -h cssa01.michalski.im disablegossiproot@cssa01:~# 
 nodetool -h cssa01.michalski.im disablethriftroot@cssa01:~# nodetool -h 
 cssa01.michalski.im drain


 … but I don’t think this takes into consideration CQL.


 So you would first disablethrift, then disablebinary


 anything else needed in modern Cassandra ?

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com




-- 
Nikolai Grigoriev
(514) 772-5178


Re: Compaction Strategy guidance

2014-11-25 Thread Nikolai Grigoriev
Hi Jean-Armel,

I am using latest and greatest DSE 4.5.2 (4.5.3 in another cluster but
there are no relevant changes between 4.5.2 and 4.5.3) - thus, Cassandra
2.0.10.

I have about 1,8Tb of data per node now in total, which falls into that
range.

As I said, it is really a problem with large amount of data in a single CF,
not total amount of data. Quite often the nodes are idle yet having quite a
bit of pending compactions. I have discussed it with other members of C*
community and DataStax guys and, they have confirmed my observation.

I believe that increasing the sstable size won't help at all and probably
will make the things worse - everything else being equal, of course. But I
would like to hear from Andrei when he is done with his test.

Regarding the last statement - yes, C* clearly likes many small servers
more than fewer large ones. But it is all relative - and can be all
recalculated to $$$ :) C* is all about partitioning of everything -
storage, traffic...Less data per node and more nodes give you lower
latency, lower heap usage etc, etc. I think I have learned this with my
project. Somewhat hard way but still, nothing is better than the personal
experience :)

On Tue, Nov 25, 2014 at 3:23 AM, Jean-Armel Luce jaluc...@gmail.com wrote:

 Hi Andrei, Hi Nicolai,

 Which version of C* are you using ?

 There are some recommendations about the max storage per node :
 http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2

 For 1.0 we recommend 300-500GB. For 1.2 we are looking to be able to
 handle 10x
 (3-5TB).

 I have the feeling that those recommendations are sensitive according many
 criteria such as :
 - your hardware
 - the compaction strategy
 - ...

 It looks that LCS lower those limitations.

 Increasing the size of sstables might help if you have enough CPU and you
 can put more load on your I/O system (@Andrei, I am interested by the
 results of your  experimentation about large sstable files)

 From my point of view, there are some usage patterns where it is better to
 have many small servers than a few large servers. Probably, it is better to
 have many small servers if you need LCS for large tables.

 Just my 2 cents.

 Jean-Armel

 2014-11-24 19:56 GMT+01:00 Robert Coli rc...@eventbrite.com:

 On Mon, Nov 24, 2014 at 6:48 AM, Nikolai Grigoriev ngrigor...@gmail.com
 wrote:

 One of the obvious recommendations I have received was to run more than
 one instance of C* per host. Makes sense - it will reduce the amount of
 data per node and will make better use of the resources.


 This is usually a Bad Idea to do in production.

 =Rob






-- 
Nikolai Grigoriev
(514) 772-5178


Re: Rule of thumb for concurrent asynchronous queries?

2014-11-25 Thread Nikolai Grigoriev
I think it all depends on how many machines will be involved in the query
(read consistency is also a factor) and how long is a typical response in
bytes. Large responses will put more pressure on the GC, which will result
in more time spent in GC and possibly long(er) GC pauses.

Cassandra can tolerate many things - but at the cost for other queries and
all the way up to the heal of the individual node.

From the original question it is not clear if all these rows are coming
from the same or few nodes (token range) or these are really 10K primary
keys - so they are spread more or less evenly across the cluster.

Also the node disk I/O may be a concern - especially if the data is not in
OS cache (or row cache if applicable).

I think it is a tough question to get a precise answer. If I had such a
problem I would try to determine the peak speed I can achieve first. I.e.
find the limiting factor (CPU or disk I/O most likely), then shoot as many
requests in as many threads as practical for the client app. Measure the
load to prove that you've determined the limiting factor correctly (either
CPU or I/O, I doubt it will be network). Then measure the latency and
decide what kind of latency you can tolerate for your use case. And then go
down from that peak load you've created by certain factor (i.e. limit
yourself to XX% of the peak load you have achieved).

On Tue, Nov 25, 2014 at 11:34 AM, Jack Krupansky j...@basetechnology.com
wrote:

 Great question. The safe answer is to do a proof of concept implementation
 and try various rates to determine where the bottleneck is. It will also
 depend on the row size. Hard to say if you will be limited by the cluster
 load or network bandwidth.

 Is there only one client talking to your cluster? Or are you asking what
 each of, say, one million clients can be simultaneously requesting?

 The rate of requests will matter as well, particularly if the cluster has
 a non-trivial load.

 My ultimate rule of thumb is simple: Moderation. Not too many threads, not
 too frequent request rate.

 It would be nice if we had a way to calculate this number (both numbers)
 for you so that a client (driver) could ping for it from the cluster, as
 well as for the cluster to return a suggested wait interval before sending
 another request based on actual load.

 -- Jack Krupansky

 -Original Message- From: Robert Wille
 Sent: Tuesday, November 25, 2014 10:57 AM
 To: user@cassandra.apache.org
 Subject: Rule of thumb for concurrent asynchronous queries?

 Suppose I have the primary keys for 10,000 rows and I want them all. Is
 there a rule of thumb for the maximum number of concurrent asynchronous
 queries I should execute?=




-- 
Nikolai Grigoriev
(514) 772-5178


Re: Compaction Strategy guidance

2014-11-25 Thread Nikolai Grigoriev
Andrei,

Oh, yes, I have scanned the top of your previous email but overlooked the
last part.

I am using SSDs so I prefer to put extra work to keep my system performing
and save expensive disk space. So far I've been able to size the system
more or less correctly so these LCS limitations do not cause too much
troubles. But I do keep the CF sharding option as backup - for me it will
be relatively easy to implement it.

On Tue, Nov 25, 2014 at 1:25 PM, Andrei Ivanov aiva...@iponweb.net wrote:

 Nikolai,

 Just in case you've missed my comment in the thread (guess you have) -
 increasing sstable size does nothing (in our case at least). That is,
 it's not worse but the load pattern is still the same - doing nothing
 most of the time. So, I switched to STCS and we will have to live with
 extra storage cost - storage is way cheaper than cpu etc anyhow:-)

 On Tue, Nov 25, 2014 at 5:53 PM, Nikolai Grigoriev ngrigor...@gmail.com
 wrote:
  Hi Jean-Armel,
 
  I am using latest and greatest DSE 4.5.2 (4.5.3 in another cluster but
 there
  are no relevant changes between 4.5.2 and 4.5.3) - thus, Cassandra
 2.0.10.
 
  I have about 1,8Tb of data per node now in total, which falls into that
  range.
 
  As I said, it is really a problem with large amount of data in a single
 CF,
  not total amount of data. Quite often the nodes are idle yet having
 quite a
  bit of pending compactions. I have discussed it with other members of C*
  community and DataStax guys and, they have confirmed my observation.
 
  I believe that increasing the sstable size won't help at all and probably
  will make the things worse - everything else being equal, of course. But
 I
  would like to hear from Andrei when he is done with his test.
 
  Regarding the last statement - yes, C* clearly likes many small servers
 more
  than fewer large ones. But it is all relative - and can be all
 recalculated
  to $$$ :) C* is all about partitioning of everything - storage,
  traffic...Less data per node and more nodes give you lower latency, lower
  heap usage etc, etc. I think I have learned this with my project.
 Somewhat
  hard way but still, nothing is better than the personal experience :)
 
  On Tue, Nov 25, 2014 at 3:23 AM, Jean-Armel Luce jaluc...@gmail.com
 wrote:
 
  Hi Andrei, Hi Nicolai,
 
  Which version of C* are you using ?
 
  There are some recommendations about the max storage per node :
 
 http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2
 
  For 1.0 we recommend 300-500GB. For 1.2 we are looking to be able to
  handle 10x
  (3-5TB).
 
  I have the feeling that those recommendations are sensitive according
 many
  criteria such as :
  - your hardware
  - the compaction strategy
  - ...
 
  It looks that LCS lower those limitations.
 
  Increasing the size of sstables might help if you have enough CPU and
 you
  can put more load on your I/O system (@Andrei, I am interested by the
  results of your  experimentation about large sstable files)
 
  From my point of view, there are some usage patterns where it is better
 to
  have many small servers than a few large servers. Probably, it is
 better to
  have many small servers if you need LCS for large tables.
 
  Just my 2 cents.
 
  Jean-Armel
 
  2014-11-24 19:56 GMT+01:00 Robert Coli rc...@eventbrite.com:
 
  On Mon, Nov 24, 2014 at 6:48 AM, Nikolai Grigoriev 
 ngrigor...@gmail.com
  wrote:
 
  One of the obvious recommendations I have received was to run more
 than
  one instance of C* per host. Makes sense - it will reduce the amount
 of data
  per node and will make better use of the resources.
 
 
  This is usually a Bad Idea to do in production.
 
  =Rob
 
 
 
 
 
 
  --
  Nikolai Grigoriev
  (514) 772-5178




-- 
Nikolai Grigoriev
(514) 772-5178


Re: Compaction Strategy guidance

2014-11-24 Thread Nikolai Grigoriev
Jean-Armel,

I have only two large tables, the rest is super-small. In the test cluster
of 15 nodes the largest table has about 110M rows. Its total size is about
1,26Gb per node (total disk space used per node for that CF). It's got
about 5K sstables per node - the sstable size is 256Mb. cfstats on a
healthy node look like this:

Read Count: 8973748
Read Latency: 16.130059053251774 ms.
Write Count: 32099455
Write Latency: 1.6124713938912671 ms.
Pending Tasks: 0
Table: wm_contacts
SSTable count: 5195
SSTables in each level: [27/4, 11/10, 104/100, 1053/1000, 4000, 0,
0, 0, 0]
Space used (live), bytes: 1266060391852
Space used (total), bytes: 1266144170869
SSTable Compression Ratio: 0.32604853410787327
Number of keys (estimate): 25696000
Memtable cell count: 71402
Memtable data size, bytes: 26938402
Memtable switch count: 9489
Local read count: 8973748
Local read latency: 17.696 ms
Local write count: 32099471
Local write latency: 1.732 ms
Pending tasks: 0
Bloom filter false positives: 32248
Bloom filter false ratio: 0.50685
Bloom filter space used, bytes: 20744432
Compacted partition minimum bytes: 104
Compacted partition maximum bytes: 3379391
Compacted partition mean bytes: 172660
Average live cells per slice (last five minutes): 495.0
Average tombstones per slice (last five minutes): 0.0

Another table of similar structure (same number of rows) is about 4x times
smaller. That table does not suffer from those issues - it compacts well
and efficiently.

On Mon, Nov 24, 2014 at 2:30 AM, Jean-Armel Luce jaluc...@gmail.com wrote:

 Hi Nikolai,

 Please could you clarify a little bit what you call a large amount of
 data ?

 How many tables ?
 How many rows in your largest table ?
 How many GB in your largest table ?
 How many GB per node ?

 Thanks.



 2014-11-24 8:27 GMT+01:00 Jean-Armel Luce jaluc...@gmail.com:

 Hi Nikolai,

 Thanks for those informations.

 Please could you clarify a little bit what you call 

 2014-11-24 4:37 GMT+01:00 Nikolai Grigoriev ngrigor...@gmail.com:

 Just to clarify - when I was talking about the large amount of data I
 really meant large amount of data per node in a single CF (table). LCS does
 not seem to like it when it gets thousands of sstables (makes 4-5 levels).

 When bootstraping a new node you'd better enable that option from
 CASSANDRA-6621 (the one that disables STCS in L0). But it will still be a
 mess - I have a node that I have bootstrapped ~2 weeks ago. Initially it
 had 7,5K pending compactions, now it has almost stabilized ad 4,6K. Does
 not go down. Number of sstables at L0  is over 11K and it is slowly slowly
 building upper levels. Total number of sstables is 4x the normal amount.
 Now I am not entirely sure if this node will ever get back to normal life.
 And believe me - this is not because of I/O, I have SSDs everywhere and 16
 physical cores. This machine is barely using 1-3 cores at most of the time.
 The problem is that allowing STCS fallback is not a good option either - it
 will quickly result in a few 200Gb+ sstables in my configuration and then
 these sstables will never be compacted. Plus, it will require close to 2x
 disk space on EVERY disk in my JBOD configuration...this will kill the node
 sooner or later. This is all because all sstables after bootstrap end at L0
 and then the process slowly slowly moves them to other levels. If you have
 write traffic to that CF then the number of sstables and L0 will grow
 quickly - like it happens in my case now.

 Once something like https://issues.apache.org/jira/browse/CASSANDRA-8301
 is implemented it may be better.


 On Sun, Nov 23, 2014 at 4:53 AM, Andrei Ivanov aiva...@iponweb.net
 wrote:

 Stephane,

 We are having a somewhat similar C* load profile. Hence some comments
 in addition Nikolai's answer.
 1. Fallback to STCS - you can disable it actually
 2. Based on our experience, if you have a lot of data per node, LCS
 may work just fine. That is, till the moment you decide to join
 another node - chances are that the newly added node will not be able
 to compact what it gets from old nodes. In your case, if you switch
 strategy the same thing may happen. This is all due to limitations
 mentioned by Nikolai.

 Andrei,


 On Sun, Nov 23, 2014 at 8:51 AM, Servando Muñoz G. smg...@gmail.com
 wrote:
  ABUSE
 
 
 
  YA NO QUIERO MAS MAILS SOY DE MEXICO
 
 
 
  De: Nikolai Grigoriev [mailto:ngrigor...@gmail.com]
  Enviado el: sábado, 22 de noviembre de 2014 07:13 p. m.
  Para: user@cassandra.apache.org
  Asunto: Re: Compaction Strategy guidance
  Importancia: Alta
 
 
 
  Stephane,
 
  As everything good, LCS comes at certain price.
 
  LCS will put most load on you I/O system (if you use spindles - you
 may need
  to be careful about that) and on CPU. Also LCS (by default) may fall
 back to
  STCS

Re: Compaction Strategy guidance

2014-11-24 Thread Nikolai Grigoriev
Andrei,

Oh, Monday mornings...Tb :)

On Mon, Nov 24, 2014 at 9:12 AM, Andrei Ivanov aiva...@iponweb.net wrote:

 Nikolai,

 Are you sure about 1.26Gb? Like it doesn't look right - 5195 tables
 with 256Mb table size...

 Andrei

 On Mon, Nov 24, 2014 at 5:09 PM, Nikolai Grigoriev ngrigor...@gmail.com
 wrote:
  Jean-Armel,
 
  I have only two large tables, the rest is super-small. In the test
 cluster
  of 15 nodes the largest table has about 110M rows. Its total size is
 about
  1,26Gb per node (total disk space used per node for that CF). It's got
 about
  5K sstables per node - the sstable size is 256Mb. cfstats on a healthy
  node look like this:
 
  Read Count: 8973748
  Read Latency: 16.130059053251774 ms.
  Write Count: 32099455
  Write Latency: 1.6124713938912671 ms.
  Pending Tasks: 0
  Table: wm_contacts
  SSTable count: 5195
  SSTables in each level: [27/4, 11/10, 104/100, 1053/1000, 4000,
 0,
  0, 0, 0]
  Space used (live), bytes: 1266060391852
  Space used (total), bytes: 1266144170869
  SSTable Compression Ratio: 0.32604853410787327
  Number of keys (estimate): 25696000
  Memtable cell count: 71402
  Memtable data size, bytes: 26938402
  Memtable switch count: 9489
  Local read count: 8973748
  Local read latency: 17.696 ms
  Local write count: 32099471
  Local write latency: 1.732 ms
  Pending tasks: 0
  Bloom filter false positives: 32248
  Bloom filter false ratio: 0.50685
  Bloom filter space used, bytes: 20744432
  Compacted partition minimum bytes: 104
  Compacted partition maximum bytes: 3379391
  Compacted partition mean bytes: 172660
  Average live cells per slice (last five minutes): 495.0
  Average tombstones per slice (last five minutes): 0.0
 
  Another table of similar structure (same number of rows) is about 4x
 times
  smaller. That table does not suffer from those issues - it compacts well
 and
  efficiently.
 
  On Mon, Nov 24, 2014 at 2:30 AM, Jean-Armel Luce jaluc...@gmail.com
 wrote:
 
  Hi Nikolai,
 
  Please could you clarify a little bit what you call a large amount of
  data ?
 
  How many tables ?
  How many rows in your largest table ?
  How many GB in your largest table ?
  How many GB per node ?
 
  Thanks.
 
 
 
  2014-11-24 8:27 GMT+01:00 Jean-Armel Luce jaluc...@gmail.com:
 
  Hi Nikolai,
 
  Thanks for those informations.
 
  Please could you clarify a little bit what you call 
 
  2014-11-24 4:37 GMT+01:00 Nikolai Grigoriev ngrigor...@gmail.com:
 
  Just to clarify - when I was talking about the large amount of data I
  really meant large amount of data per node in a single CF (table).
 LCS does
  not seem to like it when it gets thousands of sstables (makes 4-5
 levels).
 
  When bootstraping a new node you'd better enable that option from
  CASSANDRA-6621 (the one that disables STCS in L0). But it will still
 be a
  mess - I have a node that I have bootstrapped ~2 weeks ago. Initially
 it had
  7,5K pending compactions, now it has almost stabilized ad 4,6K. Does
 not go
  down. Number of sstables at L0  is over 11K and it is slowly slowly
 building
  upper levels. Total number of sstables is 4x the normal amount. Now I
 am not
  entirely sure if this node will ever get back to normal life. And
 believe me
  - this is not because of I/O, I have SSDs everywhere and 16 physical
 cores.
  This machine is barely using 1-3 cores at most of the time. The
 problem is
  that allowing STCS fallback is not a good option either - it will
 quickly
  result in a few 200Gb+ sstables in my configuration and then these
 sstables
  will never be compacted. Plus, it will require close to 2x disk space
 on
  EVERY disk in my JBOD configuration...this will kill the node sooner
 or
  later. This is all because all sstables after bootstrap end at L0 and
 then
  the process slowly slowly moves them to other levels. If you have
 write
  traffic to that CF then the number of sstables and L0 will grow
 quickly -
  like it happens in my case now.
 
  Once something like
 https://issues.apache.org/jira/browse/CASSANDRA-8301
  is implemented it may be better.
 
 
  On Sun, Nov 23, 2014 at 4:53 AM, Andrei Ivanov aiva...@iponweb.net
  wrote:
 
  Stephane,
 
  We are having a somewhat similar C* load profile. Hence some comments
  in addition Nikolai's answer.
  1. Fallback to STCS - you can disable it actually
  2. Based on our experience, if you have a lot of data per node, LCS
  may work just fine. That is, till the moment you decide to join
  another node - chances are that the newly added node will not be able
  to compact what it gets from old nodes. In your case, if you switch
  strategy the same thing may happen. This is all due to limitations
  mentioned by Nikolai.
 
  Andrei,
 
 
  On Sun, Nov 23, 2014 at 8:51 AM, Servando Muñoz G. smg...@gmail.com
 
  wrote:
   ABUSE

Re: Compaction Strategy guidance

2014-11-24 Thread Nikolai Grigoriev
I was thinking about that option and I would be curious to find out how
does this change help you. I suspected that increasing sstable size won't
help too much because the compaction throughput (per task/thread) is still
the same. So, it will simply take 4x longer to finish a compaction task. It
is possible that because of that the CPU will be under-used for even
longer.

My data model, unfortunately, requires this amount of data. And I suspect
that regardless of how it is organized I won't be able to optimize it - I
do need these rows to be in one row so I can read them quickly.

One of the obvious recommendations I have received was to run more than one
instance of C* per host. Makes sense - it will reduce the amount of data
per node and will make better use of the resources. I would go for it
myself, but it may be a challenge for the people in operations. Without a
VM this would be more tricky for them to operate such a thing and I do not
want any VMs there.

Another option is to probably simply shard my data between several
identical tables in the same keyspace. I could also think about different
keyspaces but I prefer not to spread the data for the same logical tenant
across multiple keyspaces. Use my primary key's hash and then simply do
something like mod 4 and add this to the table name :) This would
effectively reduce the number of sstables and amount of data per table
(CF). I kind of like this idea more - yes, a bit more challenge at coding
level but obvious benefits without extra operational complexity.


On Mon, Nov 24, 2014 at 9:32 AM, Andrei Ivanov aiva...@iponweb.net wrote:

 Nikolai,

 This is more or less what I'm seeing on my cluster then. Trying to
 switch to bigger sstables right now (1Gb)

 On Mon, Nov 24, 2014 at 5:18 PM, Nikolai Grigoriev ngrigor...@gmail.com
 wrote:
  Andrei,
 
  Oh, Monday mornings...Tb :)
 
  On Mon, Nov 24, 2014 at 9:12 AM, Andrei Ivanov aiva...@iponweb.net
 wrote:
 
  Nikolai,
 
  Are you sure about 1.26Gb? Like it doesn't look right - 5195 tables
  with 256Mb table size...
 
  Andrei
 
  On Mon, Nov 24, 2014 at 5:09 PM, Nikolai Grigoriev 
 ngrigor...@gmail.com
  wrote:
   Jean-Armel,
  
   I have only two large tables, the rest is super-small. In the test
   cluster
   of 15 nodes the largest table has about 110M rows. Its total size is
   about
   1,26Gb per node (total disk space used per node for that CF). It's got
   about
   5K sstables per node - the sstable size is 256Mb. cfstats on a
 healthy
   node look like this:
  
   Read Count: 8973748
   Read Latency: 16.130059053251774 ms.
   Write Count: 32099455
   Write Latency: 1.6124713938912671 ms.
   Pending Tasks: 0
   Table: wm_contacts
   SSTable count: 5195
   SSTables in each level: [27/4, 11/10, 104/100, 1053/1000,
 4000,
   0,
   0, 0, 0]
   Space used (live), bytes: 1266060391852
   Space used (total), bytes: 1266144170869
   SSTable Compression Ratio: 0.32604853410787327
   Number of keys (estimate): 25696000
   Memtable cell count: 71402
   Memtable data size, bytes: 26938402
   Memtable switch count: 9489
   Local read count: 8973748
   Local read latency: 17.696 ms
   Local write count: 32099471
   Local write latency: 1.732 ms
   Pending tasks: 0
   Bloom filter false positives: 32248
   Bloom filter false ratio: 0.50685
   Bloom filter space used, bytes: 20744432
   Compacted partition minimum bytes: 104
   Compacted partition maximum bytes: 3379391
   Compacted partition mean bytes: 172660
   Average live cells per slice (last five minutes): 495.0
   Average tombstones per slice (last five minutes): 0.0
  
   Another table of similar structure (same number of rows) is about 4x
   times
   smaller. That table does not suffer from those issues - it compacts
 well
   and
   efficiently.
  
   On Mon, Nov 24, 2014 at 2:30 AM, Jean-Armel Luce jaluc...@gmail.com
   wrote:
  
   Hi Nikolai,
  
   Please could you clarify a little bit what you call a large amount
 of
   data ?
  
   How many tables ?
   How many rows in your largest table ?
   How many GB in your largest table ?
   How many GB per node ?
  
   Thanks.
  
  
  
   2014-11-24 8:27 GMT+01:00 Jean-Armel Luce jaluc...@gmail.com:
  
   Hi Nikolai,
  
   Thanks for those informations.
  
   Please could you clarify a little bit what you call 
  
   2014-11-24 4:37 GMT+01:00 Nikolai Grigoriev ngrigor...@gmail.com:
  
   Just to clarify - when I was talking about the large amount of
 data I
   really meant large amount of data per node in a single CF (table).
   LCS does
   not seem to like it when it gets thousands of sstables (makes 4-5
   levels).
  
   When bootstraping a new node you'd better enable that option from
   CASSANDRA-6621 (the one that disables STCS in L0). But it will
 still
   be a
   mess - I have a node that I have

Re: Compaction Strategy guidance

2014-11-23 Thread Nikolai Grigoriev
Just to clarify - when I was talking about the large amount of data I
really meant large amount of data per node in a single CF (table). LCS does
not seem to like it when it gets thousands of sstables (makes 4-5 levels).

When bootstraping a new node you'd better enable that option from
CASSANDRA-6621 (the one that disables STCS in L0). But it will still be a
mess - I have a node that I have bootstrapped ~2 weeks ago. Initially it
had 7,5K pending compactions, now it has almost stabilized ad 4,6K. Does
not go down. Number of sstables at L0  is over 11K and it is slowly slowly
building upper levels. Total number of sstables is 4x the normal amount.
Now I am not entirely sure if this node will ever get back to normal life.
And believe me - this is not because of I/O, I have SSDs everywhere and 16
physical cores. This machine is barely using 1-3 cores at most of the time.
The problem is that allowing STCS fallback is not a good option either - it
will quickly result in a few 200Gb+ sstables in my configuration and then
these sstables will never be compacted. Plus, it will require close to 2x
disk space on EVERY disk in my JBOD configuration...this will kill the node
sooner or later. This is all because all sstables after bootstrap end at L0
and then the process slowly slowly moves them to other levels. If you have
write traffic to that CF then the number of sstables and L0 will grow
quickly - like it happens in my case now.

Once something like https://issues.apache.org/jira/browse/CASSANDRA-8301 is
implemented it may be better.


On Sun, Nov 23, 2014 at 4:53 AM, Andrei Ivanov aiva...@iponweb.net wrote:

 Stephane,

 We are having a somewhat similar C* load profile. Hence some comments
 in addition Nikolai's answer.
 1. Fallback to STCS - you can disable it actually
 2. Based on our experience, if you have a lot of data per node, LCS
 may work just fine. That is, till the moment you decide to join
 another node - chances are that the newly added node will not be able
 to compact what it gets from old nodes. In your case, if you switch
 strategy the same thing may happen. This is all due to limitations
 mentioned by Nikolai.

 Andrei,


 On Sun, Nov 23, 2014 at 8:51 AM, Servando Muñoz G. smg...@gmail.com
 wrote:
  ABUSE
 
 
 
  YA NO QUIERO MAS MAILS SOY DE MEXICO
 
 
 
  De: Nikolai Grigoriev [mailto:ngrigor...@gmail.com]
  Enviado el: sábado, 22 de noviembre de 2014 07:13 p. m.
  Para: user@cassandra.apache.org
  Asunto: Re: Compaction Strategy guidance
  Importancia: Alta
 
 
 
  Stephane,
 
  As everything good, LCS comes at certain price.
 
  LCS will put most load on you I/O system (if you use spindles - you may
 need
  to be careful about that) and on CPU. Also LCS (by default) may fall
 back to
  STCS if it is falling behind (which is very possible with heavy writing
  activity) and this will result in higher disk space usage. Also LCS has
  certain limitation I have discovered lately. Sometimes LCS may not be
 able
  to use all your node's resources (algorithm limitations) and this reduces
  the overall compaction throughput. This may happen if you have a large
  column family with lots of data per node. STCS won't have this
 limitation.
 
 
 
  By the way, the primary goal of LCS is to reduce the number of sstables
 C*
  has to look at to find your data. With LCS properly functioning this
 number
  will be most likely between something like 1 and 3 for most of the reads.
  But if you do few reads and not concerned about the latency today, most
  likely LCS may only save you some disk space.
 
 
 
  On Sat, Nov 22, 2014 at 6:25 PM, Stephane Legay sle...@looplogic.com
  wrote:
 
  Hi there,
 
 
 
  use case:
 
 
 
  - Heavy write app, few reads.
 
  - Lots of updates of rows / columns.
 
  - Current performance is fine, for both writes and reads..
 
  - Currently using SizedCompactionStrategy
 
 
 
  We're trying to limit the amount of storage used during compaction.
 Should
  we switch to LeveledCompactionStrategy?
 
 
 
  Thanks
 
 
 
 
  --
 
  Nikolai Grigoriev
  (514) 772-5178




-- 
Nikolai Grigoriev
(514) 772-5178


Re: Compaction Strategy guidance

2014-11-22 Thread Nikolai Grigoriev
Stephane,

As everything good, LCS comes at certain price.

LCS will put most load on you I/O system (if you use spindles - you may
need to be careful about that) and on CPU. Also LCS (by default) may fall
back to STCS if it is falling behind (which is very possible with heavy
writing activity) and this will result in higher disk space usage. Also LCS
has certain limitation I have discovered lately. Sometimes LCS may not be
able to use all your node's resources (algorithm limitations) and this
reduces the overall compaction throughput. This may happen if you have a
large column family with lots of data per node. STCS won't have this
limitation.

By the way, the primary goal of LCS is to reduce the number of sstables C*
has to look at to find your data. With LCS properly functioning this number
will be most likely between something like 1 and 3 for most of the reads.
But if you do few reads and not concerned about the latency today, most
likely LCS may only save you some disk space.

On Sat, Nov 22, 2014 at 6:25 PM, Stephane Legay sle...@looplogic.com
wrote:

 Hi there,

 use case:

 - Heavy write app, few reads.
 - Lots of updates of rows / columns.
 - Current performance is fine, for both writes and reads..
 - Currently using SizedCompactionStrategy

 We're trying to limit the amount of storage used during compaction. Should
 we switch to LeveledCompactionStrategy?

 Thanks




-- 
Nikolai Grigoriev
(514) 772-5178


Re: high context switches

2014-11-21 Thread Nikolai Grigoriev
How do the clients connect, which protocol is used and do they use
keep-alive connections? Is it possible that the clients use Thrift and the
server type is sync? It is just my guess, but in this scenario with high
number of clients connecting-disconnecting often there may be high number
of context switching.

On Fri, Nov 21, 2014 at 4:21 AM, Jan Karlsson jan.karls...@ericsson.com
wrote:

  Hello,



 We are running a 3 node cluster with RF=3 and 5 clients in a test
 environment. The C* settings are mostly default. We noticed quite high
 context switching during our tests. On 100 000 000 keys/partitions we
 averaged around 260 000 cs (with a max of 530 000).



 We were running 12 000~ transactions per second. 10 000 reads and 2000
 updates.



 Nothing really wrong with that however I would like to understand why
 these numbers are so high. Have others noticed this behavior? How much
 context switching is expected and why? What are the variables that affect
 this?



 /J




-- 
Nikolai Grigoriev
(514) 772-5178


coordinator selection in remote DC

2014-11-20 Thread Nikolai Grigoriev
Hi,

There is something odd I have observed when testing a configuration with
two DC for the first time. I wanted to do a simple functional test to prove
myself (and my pessimistic colleagues ;) ) that it works.

I have a test cluster of 6 nodes, 3 in each DC, and a keyspace that is
replicated as follows:

CREATE KEYSPACE xxx WITH replication = {

  'class': 'NetworkTopologyStrategy',

  'DC2': '3',

  'DC1': '3'

};


I have disabled the traffic compression between DCs to get more accurate
numbers.

I have set up a bunch of IP accounting rules on each node so they count the
outgoing traffic from this node to each other node. I had rules for
different ports but, of course, but it is mostly about port 7000 (or 7001)
when talking about inter-node traffic. Anyway, I have a table that shows
the traffic from any node to any node's port 7000.

I have ran a test with DCAwareRoundRobinPolicy and the client talking only
to DC1 nodes. Everything looks fine - the client has sent identical amount
of data to each of 3 nodes in DC1. These nodes inside of DC1 (I was writing
with LOCAL_ONE consistency) have sent similar amount of data to each other
that represents exactly two extra replicas.

However, when I look at the traffic from the nodes in DC1 to the nodes in
DC1 the picture is different:

  10.3.45.156

10.3.45.159

dpt:7000

117,273,075

10.3.45.156

10.3.45.160

dpt:7000

228,326,091

10.3.45.156

10.3.45.161

dpt:7000

46,924,339

10.3.45.157

10.3.45.159

dpt:7000

118,978,269

10.3.45.157

10.3.45.160

dpt:7000

230,444,929

10.3.45.157

10.3.45.161

dpt:7000

47,394,179

10.3.45.158

10.3.45.159

dpt:7000

113,969,248

10.3.45.158

10.3.45.160

dpt:7000

225,844,838

10.3.45.158

10.3.45.161

dpt:7000

46,338,939

Nodes 10.3.45.156-158 are in DC1, .159-161 - in DC2. As you can see, each
of nodes in DC1 has sent different amount of traffic to the remote nodes:
117Mb, 228Mb and 46Mb respectively. Both DC have one rack.

So, here is my question. How does node select the node in remote DC to send
the message to? I did a quick sweep through the code and I could only find
the sorting by proximity (checking the rack and DC). So, considering that
for each request I fire the targets are all 3 nodes in the remote DC, the
list will contain all 3 nodes in DC2. And, if I understood correctly, the
first node from the list is picked to send the message.

So, it seems to me that there is no any kind of round-robin-type logic is
applied when selecting the target node to forward the write to from the
list of targets in remote DC.

If this is true (and the numbers kind of show it is, right?), then probably
the list with equal proximity should be shuffled randomly? Or, instead of
picking the first target, a random one should be picked?


-- 
Nikolai Grigoriev


Re: any way to get nodetool proxyhistograms data for an entire cluster?

2014-11-20 Thread Nikolai Grigoriev
DSE can manage an existing cluster (even vanilla Cassandra). And you are
not required to use OpsCenter for DSE, of course.

As for the graphs...I do think that OpsCenter is not necessarily the ideal
graphing tool for Cassandra. If you want nice and detailed (and
independent) dashboards, you can feed Cassandra metrics to anything -
Graphite etc. I use both - OpsCenter and Graphite. There are things that
you get out of the box from OpsCenter, there are very nice things you can
only do with Graphite :)

On Thu, Nov 20, 2014 at 11:10 AM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Nov 19, 2014 at 7:20 PM, Clint Kelly clint.ke...@gmail.com
 wrote:

 We have DSE so I can use opscenter.  I was just looking for something
 more precise than the graphs that I get from opscenter.


 To be clear, I don't believe DSE is required to use OpsCenter. My
 understanding is that it can be used with vanilla Apache Cassandra, but I
 have never actually tried to do so.

 =Rob





-- 
Nikolai Grigoriev
(514) 772-5178


Re: coordinator selection in remote DC

2014-11-20 Thread Nikolai Grigoriev
Hmmm...I am using:

endpoint_snitch: com.datastax.bdp.snitch.DseDelegateSnitch

which is using:

delegated_snitch: org.apache.cassandra.locator.PropertyFileSnitch

(for this specific test cluster)

I did not check the code - is this snitch on by default and, maybe, used as
wrapper for configured endpoint_snitch?

It would explain the difference in the inter-DC traffic for sure. Also it
would not affect the local DC traffic as all nodes are replicas for the
data anyway.


On Thu, Nov 20, 2014 at 12:03 PM, Tyler Hobbs ty...@datastax.com wrote:

 The difference is likely due to the DynamicEndpointSnitch (aka dynamic
 snitch), which picks replicas to send messages to based on recently
 observed latency and self-reported load (accounting for compactions,
 repair, etc).  If you want to confirm this, you can disable the dynamic
 snitch by adding this line to cassandra.yaml: dynamic_snitch: false.

 On Thu, Nov 20, 2014 at 9:52 AM, Nikolai Grigoriev ngrigor...@gmail.com
 wrote:

 Hi,

 There is something odd I have observed when testing a configuration with
 two DC for the first time. I wanted to do a simple functional test to prove
 myself (and my pessimistic colleagues ;) ) that it works.

 I have a test cluster of 6 nodes, 3 in each DC, and a keyspace that is
 replicated as follows:

 CREATE KEYSPACE xxx WITH replication = {

   'class': 'NetworkTopologyStrategy',

   'DC2': '3',

   'DC1': '3'

 };


 I have disabled the traffic compression between DCs to get more accurate
 numbers.

 I have set up a bunch of IP accounting rules on each node so they count
 the outgoing traffic from this node to each other node. I had rules for
 different ports but, of course, but it is mostly about port 7000 (or 7001)
 when talking about inter-node traffic. Anyway, I have a table that shows
 the traffic from any node to any node's port 7000.

 I have ran a test with DCAwareRoundRobinPolicy and the client talking
 only to DC1 nodes. Everything looks fine - the client has sent identical
 amount of data to each of 3 nodes in DC1. These nodes inside of DC1 (I was
 writing with LOCAL_ONE consistency) have sent similar amount of data to
 each other that represents exactly two extra replicas.

 However, when I look at the traffic from the nodes in DC1 to the nodes in
 DC1 the picture is different:

   10.3.45.156

 10.3.45.159

 dpt:7000

 117,273,075

 10.3.45.156

 10.3.45.160

 dpt:7000

 228,326,091

 10.3.45.156

 10.3.45.161

 dpt:7000

 46,924,339

 10.3.45.157

 10.3.45.159

 dpt:7000

 118,978,269

 10.3.45.157

 10.3.45.160

 dpt:7000

 230,444,929

 10.3.45.157

 10.3.45.161

 dpt:7000

 47,394,179

 10.3.45.158

 10.3.45.159

 dpt:7000

 113,969,248

 10.3.45.158

 10.3.45.160

 dpt:7000

 225,844,838

 10.3.45.158

 10.3.45.161

 dpt:7000

 46,338,939

 Nodes 10.3.45.156-158 are in DC1, .159-161 - in DC2. As you can see, each
 of nodes in DC1 has sent different amount of traffic to the remote nodes:
 117Mb, 228Mb and 46Mb respectively. Both DC have one rack.

 So, here is my question. How does node select the node in remote DC to
 send the message to? I did a quick sweep through the code and I could only
 find the sorting by proximity (checking the rack and DC). So, considering
 that for each request I fire the targets are all 3 nodes in the remote DC,
 the list will contain all 3 nodes in DC2. And, if I understood correctly,
 the first node from the list is picked to send the message.

 So, it seems to me that there is no any kind of round-robin-type logic is
 applied when selecting the target node to forward the write to from the
 list of targets in remote DC.

 If this is true (and the numbers kind of show it is, right?), then
 probably the list with equal proximity should be shuffled randomly? Or,
 instead of picking the first target, a random one should be picked?


 --
 Nikolai Grigoriev




 --
 Tyler Hobbs
 DataStax http://datastax.com/




-- 
Nikolai Grigoriev
(514) 772-5178