Re: Switching to Incremental Repair

2024-02-15 Thread Chris Lohfink
I would recommend adding something to C* to be able to flip the repaired
state on all sstables quickly (with default OSS can turn nodes off one at a
time and use sstablerepairedset). It's a life saver to be able to revert
back to non-IR if migration going south. Same can be used to quickly switch
into IR sstables with more caveats. Probably worth a jira to add a faster

On Thu, Feb 15, 2024 at 12:50 PM Kristijonas Zalys  wrote:

> Hi folks,
> One last question regarding incremental repair.
> What would be a safe approach to temporarily stop running incremental
> repair on a cluster (e.g.: during a Cassandra major version upgrade)? My
> understanding is that if we simply stop running incremental repair, the
> cluster's nodes can, in the worst case, double in disk size as the repaired
> dataset will not get compacted with the unrepaired dataset. Similar to
> Sebastian, we have nodes where the disk usage is multiple TiBs so
> significant growth can be quite dangerous in our case. Would the only safe
> choice be to mark all SSTables as unrepaired before stopping regular
> incremental repair?
> Thanks,
> Kristijonas
> On Wed, Feb 7, 2024 at 4:33 PM Bowen Song via user <
>> wrote:
>> The over-streaming is only problematic for the repaired SSTables, but it
>> can be triggered by inconsistencies within the unrepaired SSTables
>> during an incremental repair session. This is because although an
>> incremental repair will only compare the unrepaired SSTables, but it
>> will stream both the unrepaired and repaired SSTables for the
>> inconsistent token ranges. Keep in mind that the source SSTables for
>> streaming is selected based on the token ranges, not the
>> repaired/unrepaired state.
>> Base on the above, I'm unsure running an incremental repair before a
>> full repair can fully avoid the over-streaming issue.
>> On 07/02/2024 22:41, Sebastian Marsching wrote:
>> > Thank you very much for your explanation.
>> >
>> > Streaming happens on the token range level, not the SSTable level,
>> right? So, when running an incremental repair before the full repair, the
>> problem that “some unrepaired SSTables are being marked as repaired on one
>> node but not on another” should not exist any longer. Now this data should
>> be marked as repaired on all nodes.
>> >
>> > Thus, when repairing the SSTables that are marked as repaired, this
>> data should be included on all nodes when calculating the Merkle trees and
>> no overstreaming should happen.
>> >
>> > Of course, this means that running an incremental repair *first* after
>> marking SSTables as repaired and only running the full repair *after* that
>> is critical. I have to admit that previously I wasn’t fully aware of how
>> critical this step is.
>> >
>> >> Am 07.02.2024 um 20:22 schrieb Bowen Song via user <
>> >>
>> >> Unfortunately repair doesn't compare each partition individually.
>> Instead, it groups multiple partitions together and calculate a hash of
>> them, stores the hash in a leaf of a merkle tree, and then compares the
>> merkle trees between replicas during a repair session. If any one of the
>> partitions covered by a leaf is inconsistent between replicas, the hash
>> values in these leaves will be different, and all partitions covered by the
>> same leaf will need to be streamed in full.
>> >>
>> >> Knowing that, and also know that your approach can create a lots of
>> inconsistencies in the repaired SSTables because some unrepaired SSTables
>> are being marked as repaired on one node but not on another, you would then
>> understand why over-streaming can happen. The over-streaming is only
>> problematic for the repaired SSTables, because they are much bigger than
>> the unrepaired.
>> >>
>> >>
>> >> On 07/02/2024 17:00, Sebastian Marsching wrote:
>>  Caution, using the method you described, the amount of data streamed
>> at the end with the full repair is not the amount of data written between
>> stopping the first node and the last node, but depends on the table size,
>> the number of partitions written, their distribution in the ring and the
>> 'repair_session_space' value. If the table is large, the writes touch a
>> large number of partitions scattered across the token ring, and the value
>> of 'repair_session_space' is small, you may end up with a very expensive
>> over-streaming.
>> >>> Thanks for the warning. In our case it worked well (obviously we
>> tested it on a test cluster before applying it on the production clusters),
>> but it is good to know that this might not always be the case.
>> >>>
>> >>> Maybe I misunderstand how full and incremental repairs work in C*
>> 4.x. I would appreciate if you could clarify this for me.
>> >>>
>> >>> So far, I assumed that a full repair on a cluster that is also using
>> incremental repair pretty much works like on a cluster that is not using
>> incremental repair at all, the only difference being that the set 

Re: Nodetool command to pre-load the chunk cache

2023-03-24 Thread Chris Lohfink
Something additional to consider (outside C* fix) is using a tool like
happycache <> to have
consistent pagecache between them. Might be sufficient if the data is in
memory already.


On Tue, Mar 21, 2023 at 2:48 PM Jeff Jirsa  wrote:

> We serialize the other caches to disk to avoid cold-start problems, I
> don't see why we couldn't also serialize the chunk cache? Seems worth a
> JIRA to me.
> Until then, you can probably use the dynamic snitch (badness + severity)
> to route around newly started hosts.
> I'm actually pretty surprised the chunk cache is that effective, sort of
> nice to know.
> On Tue, Mar 21, 2023 at 10:17 AM Carlos Diaz  wrote:
>> Hi Team,
>> We are heavy users of Cassandra at a pretty big bank.  Security measures
>> require us to constantly refresh our C* nodes every x number of days.  We
>> normally do this in a rolling fashion, taking one node down at a time and
>> then refreshing it with a new instance.  This process has been working for
>> us great for the past few years.
>> However, we recently started having issues when a newly refreshed
>> instance comes back online, our automation waits a few minutes for the node
>> to become "ready (UN)" and then moves on to the next node.  The problem
>> that we are facing is that when the node is ready, the chunk cache is still
>> empty so when the node starts accepting new connections, queries that go to
>> take much longer to respond and this causes errors for our apps.
>> I was thinking that it would be great if we had a nodetool command that
>> would allow us to prefetch a certain table or a set of tables to preload
>> the chunk cache.  Then we could simply add another check (nodetool info?),
>> to ensure that the chunk cache has been preloaded enough to handle queries
>> to this particular node.
>> Would love to hear others' feedback on the feasibility of this idea.
>> Thanks!

[Request] End user comments for Apache Cassandra 4.1

2022-07-27 Thread Chris Thornett
Hey everyone,

We're pulling together comments from end users on the release of Apache
Cassandra 4.1 (coming soon).

If you would like to contribute, please email them to me on *chris at
constantia dot io*.

Essentially, we're looking for positive quotes on your decision to use
Cassandra and you're welcome to discuss what features you're looking
forward to in the new release.

Thanks in advance for your help! Promoting how Cassandra is being used
worldwide encourages more people to look at the project and brings in new
supporters and contributors, so it makes a real difference and is
greatly appreciated!



Chris Thornett
senior content strategist,

[Marketing] Share your ApacheCon experiences

2022-05-12 Thread Chris Thornett
Have you attended ApacheCon before? Maybe you've presented in the past?
Please could you add some comments to this short form:

We're looking to use quotes from the community to promote participation at
future events and encourage paper submissions. Thanks!


Chris Thornett
senior content strategist,

Interview opportunity for Cassandra users

2022-01-10 Thread Chris Thornett
Hello everyone,

My name is Chris, and I provide content support for the Apache Cassandra
project. As part of Apache Cassandra's ongoing content marketing, I'd like
to give Cassandra users the opportunity to participate in a little
interview series called 'Inside Cassandra', which will be published on the
Apache Cassandra blog. Here's an example of something we've done

This is an opportunity for you to talk about how you're using Cassandra and
perhaps talk about trends you're seeing in your field and anything topical
around the Cassandra ecosystem (as long as it's open source and not used as
a promo opportunity for a commercial product, please, as we are an Apache

By sharing your experiences, you'll be helping other users learn more about
Cassandra, share your knowledge, and help spread the word
about the database, which everyone in the community would greatly

The interviews themselves would only take half an hour. We'd record them
and then write them up as blog entries and/or potentially offer them to
other press outlets (if you'd be interested in permitting us to do that as

*Case Studies*
Additionally, if you think your company would be happy to be interviewed
for a case study, we're always looking for those to write up and add to the
dedicated case study page here -

If you're interested in either option, please let me know at


Chris Thornett
senior content strategist,

Re: Performance drop of current Java drivers

2020-05-01 Thread Chris Splinter
Hi Matthias,

I have forwarded this to the developers that work on the Java driver and
they will be looking into this first thing next week.

Will circle back here with findings,


On Fri, May 1, 2020 at 12:28 AM Erick Ramirez 

> Matthias, I don't have an answer to your question but I just wanted to
> note that I don't believe the driver contributors actively watch this
> mailing list (I'm happy to be corrected  ) so I'd recommend you
> cross-post in the Java driver channels as well. Cheers!

Re: COPY command with where condition

2020-01-17 Thread Chris Splinter
Do you know your partition keys?

One option could be to enumerate that list of partition keys in separate
cmds to make the individual operations less expensive for the cluster.

For example:
Say your partition key column is called id and the ids in your database are

You could do
./dsbulk unload --dsbulk.schema.keyspace 'dev_keyspace' -query "SELECT *
FROM probe_sensors WHERE id = 1 AND localisation_id = 208812" -url
./dsbulk unload --dsbulk.schema.keyspace 'dev_keyspace' -query "SELECT *
FROM probe_sensors WHERE id = 2 AND localisation_id = 208812" -url
./dsbulk unload --dsbulk.schema.keyspace 'dev_keyspace' -query "SELECT *
FROM probe_sensors WHERE id = 3 AND localisation_id = 208812" -url

Does that option work for you?

On Fri, Jan 17, 2020 at 12:17 PM adrien ruffie 

> I don't really know for the moment in production environment, but for
> developpment environment the table contains more than 10.000.000 rows.
> But we need just a sub dataset of this table not the entirety ...
> ------
> *De :* Chris Splinter 
> *Envoyé :* vendredi 17 janvier 2020 17:40
> *À :* adrien ruffie 
> *Cc :* ; Erick
> Ramirez 
> *Objet :* Re: COPY command with where condition
> What you are seeing there is a standard read timeout, how many rows do you
> expect back from that query?
> On Fri, Jan 17, 2020 at 9:50 AM adrien ruffie 
> wrote:
> Thank you very much,
>  so I do this request with for example -->
> ./dsbulk unload --dsbulk.schema.keyspace 'dev_keyspace' -query "SELECT *
> FROM probe_sensors WHERE localisation_id = 208812 ALLOW FILTERING" -url
> /home/dump
> But I get the following error
> com.datastax.dsbulk.executor.api.exception.BulkExecutionException:
> Statement execution failed: SELECT * FROM crt_sensors WHERE site_id =
> 208812 ALLOW FILTERING (Cassandra timeout during read query at consistency
> LOCAL_ONE (1 responses were required but only 0 replica responded))
> but I configured my driver with following driver.conf, but nothing work
> correctly. Do you know what is the problem ?
> datastax-java-driver {
> basic {
> contact-points = ["data1com:9042",""]
> request {
> timeout = "200"
> consistency = "LOCAL_ONE"
> }
> }
> advanced {
> auth-provider {
> class = PlainTextAuthProvider
> username = "superuser"
> password = "mypass"
> }
> }
> }
> --
> *De :* Chris Splinter 
> *Envoyé :* vendredi 17 janvier 2020 16:17
> *À :* 
> *Cc :* Erick Ramirez 
> *Objet :* Re: COPY command with where condition
> DSBulk has an option that lets you specify the query ( including a WHERE
> clause )
> See Example 19 in this blog post for details:
> On Fri, Jan 17, 2020 at 7:34 AM Jean Tremblay <
>> wrote:
> Did you think about using a Materialised View to generate what you want to
> keep, and then use DSBulk to extract the data?
> On 17 Jan 2020, at 14:30 , adrien ruffie 
> wrote:
> Sorry I come back to a quick question about the bulk loader ...
> I read this : "Operations such as converting strings to lowercase,
> arithmetic on input columns, or filtering out rows based on some criteria,
> are not supported. "
> Consequently, it's still not possible to use a WHERE clause with DSBulk,
> right ?
> I don't really know how I can do it, in order to don't keep the wholeness
> of business data already stored and which don't need to export...
> --
> *De :* adrien ruffie 
> *Envoyé :* vendredi 17 janvier 2020 11:39
> *À :* Erick Ramirez ; <
> *Objet :* RE: COPY command with where condition
> Thank a lot !
> It's a good news for DSBulk ! I will take a look around this solution.
> best regards,
> Adrian
> --
> *De :* Erick Ramirez 
> *Envoyé :* vendredi 17 janvier 2020 10:02
> *À :* 
> *Objet :* Re: COPY command with where condition
> The COPY command doesn't support filtering and it doesn't perform well for
> large tables.
> Have you considered the DSBulk tool from DataStax? Previously, it only
> worked with Dat

Re: COPY command with where condition

2020-01-17 Thread Chris Splinter
What you are seeing there is a standard read timeout, how many rows do you
expect back from that query?

On Fri, Jan 17, 2020 at 9:50 AM adrien ruffie 

> Thank you very much,
>  so I do this request with for example -->
> ./dsbulk unload --dsbulk.schema.keyspace 'dev_keyspace' -query "SELECT *
> FROM probe_sensors WHERE localisation_id = 208812 ALLOW FILTERING" -url
> /home/dump
> But I get the following error
> com.datastax.dsbulk.executor.api.exception.BulkExecutionException:
> Statement execution failed: SELECT * FROM crt_sensors WHERE site_id =
> 208812 ALLOW FILTERING (Cassandra timeout during read query at consistency
> LOCAL_ONE (1 responses were required but only 0 replica responded))
> but I configured my driver with following driver.conf, but nothing work
> correctly. Do you know what is the problem ?
> datastax-java-driver {
> basic {
> contact-points = ["data1com:9042",""]
> request {
> timeout = "200"
> consistency = "LOCAL_ONE"
> }
> }
> advanced {
> auth-provider {
> class = PlainTextAuthProvider
> username = "superuser"
> password = "mypass"
> }
> }
> }
> --
> *De :* Chris Splinter 
> *Envoyé :* vendredi 17 janvier 2020 16:17
> *À :* 
> *Cc :* Erick Ramirez 
> *Objet :* Re: COPY command with where condition
> DSBulk has an option that lets you specify the query ( including a WHERE
> clause )
> See Example 19 in this blog post for details:
> On Fri, Jan 17, 2020 at 7:34 AM Jean Tremblay <
>> wrote:
> Did you think about using a Materialised View to generate what you want to
> keep, and then use DSBulk to extract the data?
> On 17 Jan 2020, at 14:30 , adrien ruffie 
> wrote:
> Sorry I come back to a quick question about the bulk loader ...
> I read this : "Operations such as converting strings to lowercase,
> arithmetic on input columns, or filtering out rows based on some criteria,
> are not supported. "
> Consequently, it's still not possible to use a WHERE clause with DSBulk,
> right ?
> I don't really know how I can do it, in order to don't keep the wholeness
> of business data already stored and which don't need to export...
> --
> *De :* adrien ruffie 
> *Envoyé :* vendredi 17 janvier 2020 11:39
> *À :* Erick Ramirez ; <
> *Objet :* RE: COPY command with where condition
> Thank a lot !
> It's a good news for DSBulk ! I will take a look around this solution.
> best regards,
> Adrian
> --
> *De :* Erick Ramirez 
> *Envoyé :* vendredi 17 janvier 2020 10:02
> *À :* 
> *Objet :* Re: COPY command with where condition
> The COPY command doesn't support filtering and it doesn't perform well for
> large tables.
> Have you considered the DSBulk tool from DataStax? Previously, it only
> worked with DataStax Enterprise but a few weeks ago, it was made free and
> works with open-source Apache Cassandra. For details, see this blogpost
> <>.
> Cheers!
> On Fri, Jan 17, 2020 at 6:57 PM adrien ruffie 
> wrote:
> Hello all,
> In my company we want to export a big dataset of our cassandra's ring.
> We search to use COPY command but I don't find if and how can a WHERE
> condition can be use ?
> Because we need to export only several data which must be return by a
> WHERE closure, specially
> and unfortunately with ALLOW FILTERING due to several old tables which
> were poorly conceptualized...
> Do you know a means to do that please ?
> Thank all and best regards
> Adrian

Re: COPY command with where condition

2020-01-17 Thread Chris Splinter
DSBulk has an option that lets you specify the query ( including a WHERE
clause )

See Example 19 in this blog post for details:

On Fri, Jan 17, 2020 at 7:34 AM Jean Tremblay <> wrote:

> Did you think about using a Materialised View to generate what you want to
> keep, and then use DSBulk to extract the data?
> On 17 Jan 2020, at 14:30 , adrien ruffie 
> wrote:
> Sorry I come back to a quick question about the bulk loader ...
> I read this : "Operations such as converting strings to lowercase,
> arithmetic on input columns, or filtering out rows based on some criteria,
> are not supported. "
> Consequently, it's still not possible to use a WHERE clause with DSBulk,
> right ?
> I don't really know how I can do it, in order to don't keep the wholeness
> of business data already stored and which don't need to export...
> --
> *De :* adrien ruffie 
> *Envoyé :* vendredi 17 janvier 2020 11:39
> *À :* Erick Ramirez ; <
> *Objet :* RE: COPY command with where condition
> Thank a lot !
> It's a good news for DSBulk ! I will take a look around this solution.
> best regards,
> Adrian
> --
> *De :* Erick Ramirez 
> *Envoyé :* vendredi 17 janvier 2020 10:02
> *À :* 
> *Objet :* Re: COPY command with where condition
> The COPY command doesn't support filtering and it doesn't perform well for
> large tables.
> Have you considered the DSBulk tool from DataStax? Previously, it only
> worked with DataStax Enterprise but a few weeks ago, it was made free and
> works with open-source Apache Cassandra. For details, see this blogpost
> .
> Cheers!
> On Fri, Jan 17, 2020 at 6:57 PM adrien ruffie 
> wrote:
> Hello all,
> In my company we want to export a big dataset of our cassandra's ring.
> We search to use COPY command but I don't find if and how can a WHERE
> condition can be use ?
> Because we need to export only several data which must be return by a
> WHERE closure, specially
> and unfortunately with ALLOW FILTERING due to several old tables which
> were poorly conceptualized...
> Do you know a means to do that please ?
> Thank all and best regards
> Adrian

Unified DataStax drivers

2020-01-16 Thread Chris Splinter
Hi all,

Last September, Jonathan Ellis announced at ApacheCon NA
<> that DataStax was going to unify the
drivers that we develop for Apache Cassandra and DataStax Enterprise into a
single open-source, Apache v2.0 Licensed driver. Yesterday, we released
this new version of the drivers across our C++, C#, Java, Node.js and
Python drivers. See the blog post
<> for
links to the source code and documentation.

With this unified driver, we are committing to developing all of our new
functionality in this single driver going forward, available for all
Cassandra users and not just DataStax customers. This means that the
following are now available for all users:

Java: Spring Boot Starter

This starter is currently available in DataStax Labs
our goal is to get it into the Spring Boot project. Also of note, Mark
Paluch <> and the team that works on Spring
Data Cassandra recently completed their upgrade to the 4.x line of the Java
Driver ( DATACASS-656 <> ).

Java: Built-in support for Reactive programming

This new version of the Java Driver ( v4.4.0 ) now has an executeReactive
method on CqlSession for those working with Reactive Streams. See the
for details.

Java, Node.js: New Load Balancing Policy

The Java and Node.js drivers now have a new load balancing policy that uses
the in-flight request count for each node to drive the Power of 2 Choices
<> decision
and takes into account the dequeuing rate of the in-flight requests to
avoid slow nodes. In addition, the amount of time that a node has been UP
is also considered when creating the query plan to only send requests to
nodes when they are ready. We are also working to get this into the C++, C#
and Python drivers soon.

Python: Pre-Built Wheels

Previously we only had pre-built wheels for the DSE driver but now they are
available for everyone to use in this new version of the driver ( v3.21.0

Along with the bulk loader and Kafka connector
<> that we
made available for use with Apache Cassandra in December last year, we hope
that this helps simplify the picture for those that use our drivers.



Re: Replication system_distributed

2020-01-10 Thread Chris Splinter
Hi Marcel,

The RF for that keyspace is currently hardcoded, see CASSANDRA-11098
<>. I am not sure why
your RF switched from 1 to 3 after you restarted a cluster, I tried the
same and it remained at 1 for me.

The tables in that keyspace are used to store history about the repair
operations, having it as RF=3 shouldn't affect the performance of the
repair. See CASSANDRA-5839
<> for when / why it
was introduced.

Changing the replication from 3 to 1 for system_distributed is not a good
idea for the same reasons why changing the replication of *any* keyspace to
1 is not a good idea. You lose the ability to query that data if a single
node goes down.

Hope this helps,


On Wed, Jan 8, 2020 at 1:23 AM Marcel Jakobi  wrote:

> Hi,
> the default definition of the keyspace system_distributed is:
> CREATE KEYSPACE system_distributed WITH replication = {'class':
> 'SimpleStrategy', 'replication_factor': '3'}  AND durable_writes = true;
> If I understand correctly, every repair information will be replicated on
> three servers in the cluster. I have changed the RF to ‚1‘. Once i stop the
> entire cluster, the replication factor changes again to 3. It seems
> Cassandra wants it to be 3.
> Doesn`t that reduce performance on repair operation?
> Why is the RF changed again after restart the cluster?
> Are there reasons why you shouldn`t change the replication factor to 1?
> Thanks,
> Marcel

Re: oversized partition detection ? monitoring the partitions growth ?

2019-11-01 Thread Chris Lohfink
You can set compaction_large_partition_warning_threshold_mb and alert on

Writing large partition {}/{}:{} ({}) to sstable {}


On Thu, Oct 31, 2019 at 8:01 AM Eric LELEU  wrote:

> Hi,
> I'm not sure that your are able to log which partition has reached 100MB
> but you may monitor the "EstimatedPartitionSizeHistogram" and take the
> max value (or 99ct, 95ct) to trigger an alert using your monitoring system.
> regards,
> Eric
> Le 31/10/2019 à 12:37, a écrit :
> Hi,
> how can I detect a partition that reaches the 100MB ? is it possible to
> log the size of every partition one time per day ?
> regards,
> Nicolas Jäger

Re: GC Tuning

2019-10-19 Thread Chris Lohfink
"It depends" on your version and heap size but G1 is easier to get right so
probably wanna stick with that unless you are using small heaps or really
interested in tuning it (likely for massively smaller gains then tuning
your data model). There is no GC algo that is strictly better than others
in all scenarios unfortunately. If your JVM supports it, ZGC or Shenandoah
are likely going to give you the best latencies.


On Fri, Oct 18, 2019 at 8:41 PM Sergio Bilello 

> Hello!
> Is it still better to use ParNew + CMS Is it still better than G1GC  these
> days?
> Any recommendation for i3.xlarge nodes read-heavy workload?
> Thanks,
> Sergio
> -
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Re: loosing data during saving data from java

2019-10-19 Thread Chris Lohfink
If the writes are being coming fast enough that the commitlog cant keep up
it will block applying mutations the the memtable (even with periodic once
hit >1.5x flush time). Things will queue up and possibly timeout but they
will not be acknowledged until applied. If you do it enough fast enough you
can dump a lot into the mutation queue and you can cause the node to OOM or
GC thrash, but it wont acknowledge the writes so you wont lose the data.

If you firing off async writes and not waiting for acknowledgement and
assume they succeeded you may lose data if C* did not succeed (which you
will be notified of via a WriteFailure, WriteTimeout, or an
OperationTimeout). A simple write like that can be idempotent so you can
just try again on failure.


On Sat, Oct 19, 2019 at 1:26 AM adrien ruffie 

> Thank Jeff 
> but if you save several data to fast with cassandra repository and if
> cassandra doesn't have the same speed and inserts more slowly.
> What is the bevahior ? cassandra store the overflow in a additionnal
> buffer ? No data can be lost on the cassandra's side ?
> Thank a lot.
> Adrian
> --
> *De :* Jeff Jirsa 
> *Envoyé :* samedi 19 octobre 2019 00:41
> *À :* cassandra 
> *Objet :* Re: loosing data during saving data from java
> There is no buffer in cassandra that is known to (or suspected to)
> lose acknowledged writes if it's overwhelmed.
> There may be a client bug where you send so many async writes that they
> overwhelm a bounded queue, or otherwise get dropped or timeout, but those
> would be client bugs, and I'm not sure this list can help you with them.
> On Fri, Oct 18, 2019 at 3:16 PM adrien ruffie 
> wrote:
> Hello all,
> I have a table cassandra where I insert quickly several java entity
> about 15.000 entries by minutes. But at the process ending, I only
> have for exemple 199.921 entries instead 312.212
> If I truncate the table and relaunch the process, several time I get
> 199.354
> or 189.012 entries ... not a really fixed entries saved any time ...
> several coworker tell me, they heard about a buffer which can be
> overwhelmed
> sometimes, and loosing several entities stacked for insertion ...
> right ?
> Because I don't understand why this loosing insertion appears ...
> And I java code is very simple like below:
> myEntitiesList.forEach(myEntity -> {
>   try {
> } catch (Exception e) {
> e.printStackTrace();
> }
> }
> And the repository is a:
> public interface MyEntityRepository extends ReactiveCassandraRepository yEntity, String> {
> }
> Some one already heard about this problem ?
> Thank you very must and best regards
> Adrian

Re: Collecting Latency Metrics

2019-05-30 Thread Chris Lohfink
For what it is worth, generally I would recommend just using the mean vs
calculating it yourself. It's a lot easier and averages are meaningless for
anything besides trending anyway (which is really what this is useful for,
finding issues on the larger scale), especially with high volume clusters
so the loss in accuracy kinda moot. Your average for local reads/writes
will almost always be sub millisecond but you might end up having 500
millisecond requests or worse that the mean will hide.


On Thu, May 30, 2019 at 6:30 AM shalom sagges 

> Thanks for your replies guys. I really appreciate it.
> @Alain, I use Graphite for backend on top of Grafana. But the goal is to
> move from Graphite to Prometheus eventually.
> I tried to find a direct way of getting a specific Latency metric in
> average and as Chris pointed out, then Mean value isn't that accurate.
> I do not wish to use the percentile metrics either, but a single latency
> metric like the *"Local read latency" *output in nodetool tablestats.
> Looking at the code of nodetool tablestats, it seems that C* also divides
> *ReadTotalLatency.Count* with *ReadLatency.Count *to get the latency
> result.
> So I guess I will have no choice but to run the calculation on my own via
> Graphite:
> divideSeries(averageSeries(keepLastValue(nonNegativeDerivative($$host.org_apache_cassandra_metrics.Table.$ks.$cf.ReadTotalLatency.Count))),averageSeries(keepLastValue(nonNegativeDerivative($$host.org_apache_cassandra_metrics.Table.$ks.$cf.ReadLatency.Count
> Does this seem right to you?
> Thanks!
> On Thu, May 30, 2019 at 12:34 AM Paul Chandler  wrote:
>> There are various attributes under
>> org.apache.cassandra.metrics.ClientRequest.Latency.Read these measure the
>> latency in milliseconds
>> Thanks
>> Paul
>> > On 29 May 2019, at 15:31, shalom sagges  wrote:
>> >
>> > Hi All,
>> >
>> > I'm creating a dashboard that should collect read/write latency metrics
>> on C* 3.x.
>> > In older versions (e.g. 2.0) I used to divide the total read latency in
>> microseconds with the read count.
>> >
>> > Is there a metric attribute that shows read/write latency without the
>> need to do the math, such as in nodetool tablestats "Local read latency"
>> output?
>> > I saw there's a Mean attribute in
>> org.apache.cassandra.metrics.ReadLatency but I'm not sure this is the right
>> one.
>> >
>> > I'd really appreciate your help on this one.
>> > Thanks!
>> >
>> >
>> -
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

Re: Collecting Latency Metrics

2019-05-30 Thread Chris Lohfink
> org.apache.cassandra.metrics.ClientRequest.Latency.Read these measure the
> latency in milliseconds

Its actually in microseconds, unless calling the values() operation which
gives the histogram in nanoseconds

On Wed, May 29, 2019 at 4:34 PM Paul Chandler  wrote:

> There are various attributes under
> org.apache.cassandra.metrics.ClientRequest.Latency.Read these measure the
> latency in milliseconds
> Thanks
> Paul
> > On 29 May 2019, at 15:31, shalom sagges  wrote:
> >
> > Hi All,
> >
> > I'm creating a dashboard that should collect read/write latency metrics
> on C* 3.x.
> > In older versions (e.g. 2.0) I used to divide the total read latency in
> microseconds with the read count.
> >
> > Is there a metric attribute that shows read/write latency without the
> need to do the math, such as in nodetool tablestats "Local read latency"
> output?
> > I saw there's a Mean attribute in
> org.apache.cassandra.metrics.ReadLatency but I'm not sure this is the right
> one.
> >
> > I'd really appreciate your help on this one.
> > Thanks!
> >
> >
> -
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Re: Collecting Latency Metrics

2019-05-29 Thread Chris Lohfink
To answer your question
org.apache.cassandra.metrics:type=Table,name=ReadTotalLatency can give you
the total local read latency in microseconds and you can get the count from
the Latency read metric.

If you are going to do that be sure to do it on the delta from previous
query (new - last) for both total latency and counter or else you will
slowly converge to a global average that will almost never change as the
quantity of reads simply removes outliers. The mean attribute of the
Latency metric you mentioned will give you an approximation for this
actually as its taking the total/count of a decaying histogram of the
latencies. It will however be even less accurate than using the deltas
since the bounds of the decaying wont necessarily match up with your
reading intervals and histogram introduces a worst case 20% round up. Even
with using deltas though this will hide outliers, you could end up with
really bad queries that don't even show up as a tick on your graph
(although *generally* it will).


On Wed, May 29, 2019 at 9:32 AM shalom sagges 

> Hi All,
> I'm creating a dashboard that should collect read/write latency metrics on
> C* 3.x.
> In older versions (e.g. 2.0) I used to divide the total read latency in
> microseconds with the read count.
> Is there a metric attribute that shows read/write latency without the need
> to do the math, such as in nodetool tablestats "Local read latency" output?
> I saw there's a Mean attribute in org.apache.cassandra.metrics.ReadLatency
> but I'm not sure this is the right one.
> I'd really appreciate your help on this one.
> Thanks!

Re: GraalVM

2019-05-09 Thread Chris Hane
Awesome.  Will try to join.

Thanks for the links.  Will look through them also.

On Thu, May 9, 2019 at 8:33 AM Sebastian Estevez <> wrote:

> Hi Chris,
> Funny you mention this today of all days because we're doing a twitch
> streaming session in a few hours on this very topic with Adron from our
> dev-rel team.
> The short answer is yes it works. Here's the example project we're working
> on that uses GraalVM via (it's
> the app we'll be using for the Drone Race at Accelerate).
> Here's the bit of code where I wrap the datastax java driver
> The remaining code is just statements and business logic. Very little
> boiler plate. So far I really like it from a dev experience perspective,
> especially the hot reloading you get for backend code with quarkus.
> Feel free to join us at 3pm EST to see us code and ask questions
> All the best,
> Sebastián Estévez | Vanguard Solution Architect
> Mobile +1.954.905.8615
>  |
> <>
> <> <>
> <> <>
> <>
> 20% Discount Code: estevez20
> On Thu, May 9, 2019 at 12:51 AM Chris Hane  wrote:
>> Has anyone worked with graalvm to include a cql driver in the
>> native-image build?
>> Looking to see if it is possible or known to not be possible?
>> Thanks,
>> Chris


2019-05-08 Thread Chris Hane
Has anyone worked with graalvm to include a cql driver in the native-image

Looking to see if it is possible or known to not be possible?


Re: Cassandra config in table

2019-02-25 Thread Chris Lohfink
In 4.0+ you can SELECT * FROM system_views.settings;


On Mon, Feb 25, 2019 at 9:22 AM Abdul Patel  wrote:

> Do we have any sustem table which stores all config details which we have
> in yaml or cassandra

Re: Cassandra collection tombstones

2019-01-25 Thread Chris Lohfink
>  The "estimated droppable tombstone" value is actually always wrong. Because 
> it's an estimate that does not consider overlaps (and I'm not sure about the 
> fact it considers the gc_grace_seconds either).

It considers the time the tombstone was created and the gc_grace_seconds, it 
doesn't matter if the tombstone is overlapped it still need to be kept for the 
gc_grace before purging or it can result in data resurrection. sstablemetadata 
cannot reliably or safely know the table parameters that are not kept in the 
sstable so to get an accurate value you have to provide a -g or 
--gc-grace-seconds parameter. I am not sure where the "always wrong" comes in 
as the quantity of data thats being shadowed is not what its tracking (although 
it would be more meaningful for single sstable compactions if it did), just 
when tombstones can be purged.


> On Jan 25, 2019, at 8:11 AM, Alain RODRIGUEZ  wrote:
> Hello, 
> I think you might be inserting on the top of an existing collection, 
> implicitly, Cassandra creates a range tombstone. Cassandra does not 
> update/delete data, it always inserts (data or tombstone). Then eventually 
> compaction merges the data and evict the tombstones. Thus, when overwriting 
> an entire collection, Cassandra performs a delete first under the hood.
> I wrote about this, in this post about 2 years ago, in the middle of this 
> (long) article: 
> <>
> Here is the part that might be of interest in your case:
> "Note: When using collections, range tombstones will be generated by INSERT 
> and UPDATE operations every time you are using an entire collection, and not 
> updating parts of it. Inserting a collection over an existing collection, 
> rather than appending it or updating only an item in it, leads to range 
> tombstones insert followed by the insert of the new values for the 
> collection. This DELETE operation is hidden leading to some weird and 
> frustrating tombstones issues."
> and
> "From the mailing list I found out that James Ravn posted about this topic 
> using list example, but it is true for all the collections, so I won’t go 
> through more details, I just wanted to point this out as it can be 
> surprising, see: 
> <>"
> Thus to specifically answer your questions:
>  Does this tombstone ever get removed?
> Yes, after gc_grace_seconds (table option) happened AND if the data that is 
> shadowed by the tombstone is also part of the same compaction (all the 
> previous shards need to be there if I remember correctly). So yes, but 
> eventually, not immediately nor any time soon (10+ days by default). 
> Also when I run sstablemetadata on the only sstable, it shows "Estimated 
> droppable tombstones" as 0.5", Similarly it shows one record with epoch time 
> as insert time for - "Estimated tombstone drop times: 1548384720: 1". Does it 
> mean that when I do sstablemetadata on a table having collections, the 
> estimated droppable tombstone ratio and drop times values are not true and 
> dependable values due to collection/list range tombstones?
> I do not remember this precisely but you can check the code, it's worth 
> having a look. The "estimated droppable tombstone" value is actually always 
> wrong. Because it's an estimate that does not consider overlaps (and I'm not 
> sure about the fact it considers the gc_grace_seconds either). But also 
> because calculation does not count a certain type of tombstones and the 
> weight of range tombstones compared to the tombstone cells makes the count 
> quite inaccurate: 
> <>.
> I think this evolved since I looked at it and might not remember well, but 
> this value is definitely not accurate. 
> If you're re-inserting a collection for a given existing partition often, 
> there is probably plenty of tombstones sitting around though, that's almost 
> guaranteed.
> Does tombstone_threshold of compaction depend on the sstablemetadata 
> threshold value? If so then for tables having collections, this is not a true 
> threshold right?
> Yes, I believe the tombstone threshold actually uses the "estimated droppable 
> tombstone" value to chose 

Re: Compact storage removal effect

2019-01-22 Thread Chris Lohfink
In 3.x+ the format on disk is the same with compact storage on or off so you 
shouldn't expect much of a difference in table size with the new storage format 
compared to compact vs non compact in 2.x.


> On Jan 22, 2019, at 10:21 AM, Nitan Kainth  wrote:
> hey Chris,
> We upgraded form 3.0.4 to 3.11. yes, I did run upgradesstables -a to migrate 
> sstables. 
> Here is the table structure:
> CREATE TABLE ks.cf1 (
> key text,
> column1 timestamp,
> value blob,
> PRIMARY KEY (key, column1)
> CREATE TABLE ks.cf2 (
> key bigint,
> column1 text,
> value blob,
> PRIMARY KEY (key, column1)
> CREATE TABLE ks.cf3 (
> key text,
> column1 timestamp,
> value int,
> PRIMARY KEY (key, column1)
> On Tue, Jan 22, 2019 at 10:07 AM Chris Lohfink  
> wrote:
> What version are you running? Did you include an upgradesstables -a or 
> something to rebuild without the compact storage in your migration?
> After 3.0 the new format can be more or less the same size as the 2.x compact 
> storage tables depending on schema (which can impact things a lot).
> Chris
> > On Jan 22, 2019, at 9:58 AM, Nitan Kainth  > <>> wrote:
> > 
> > Hi,
> > 
> > We are testing to migrate off from compact storage. After removing compact 
> > storage, we were hoping to see an increase in disk usage but nothing 
> > changed. 
> > any feedback, why didn't we see an increase in storage?
> -
> To unsubscribe, e-mail: 
> <>
> For additional commands, e-mail: 
> <>

Re: Compact storage removal effect

2019-01-22 Thread Chris Lohfink
What version are you running? Did you include an upgradesstables -a or 
something to rebuild without the compact storage in your migration?

After 3.0 the new format can be more or less the same size as the 2.x compact 
storage tables depending on schema (which can impact things a lot).


> On Jan 22, 2019, at 9:58 AM, Nitan Kainth  wrote:
> Hi,
> We are testing to migrate off from compact storage. After removing compact 
> storage, we were hoping to see an increase in disk usage but nothing changed. 
> any feedback, why didn't we see an increase in storage?

To unsubscribe, e-mail:
For additional commands, e-mail:

How can I limit the non-heap memory for Cassandra

2019-01-02 Thread Chris Mildebrandt

Is there’s a way to limit Cassandra’s off-heap memory usage? I can’t find a
way to limit the memory used for row caches, bloom filters, etc. We’re
running Cassandra in a container and would like to place limits on it to
avoid it becoming a “noisy neighbor”. But we also don’t want it killed by
the oom killer, so just placing limits on the container won't help.


Re: High CPU usage on some of the nodes due to message coalesce

2018-10-20 Thread Chris Lohfink
1s young gcs are horrible and likely cause of some of your bad metrics. How 
large are your mutations/query results and what gc/heap settings are you using?

You can use 
<> to see the threads generating 
allocation pressure and using the cpu (ttop) and what garbage is being created 
(hh --dead-young).

Just a shot in the dark, I would guess you have rather large mutations putting 
pressure on commitlog and heap. G1 with a larger heap might help in that 
scenario to reduce fragmentation and adjust its eden and survivor regions to 
the allocation rate better (but give it a bigger reserve space) but theres 
limits to what can help if you cant change your workload. Without more info on 
schema etc its hard to tell but maybe that can help give you some ideas on 
places to look. It could just as likely be repair coordination, wide partition 
reads, or compactions so need to look more at what within the app is causing 
the pressure to know if its possible to improve with settings or if the load 
your application is producing exceeds what your cluster can handle (needs more 


> On Oct 20, 2018, at 5:18 AM, onmstester onmstester 
>  wrote:
> 3 nodes in my cluster have 100% cpu usage and most of it is used by 
> org.apache.cassandra.util.coalesceInternal and
> The most active threads are the messaging-service-incomming.
> Other nodes are normal, having 30 nodes, using Rack Aware strategy. with 10 
> rack each having 3 nodes. The problematic nodes are configured for one rack, 
> on normal write load, system.log reports too many hint message dropped (cross 
> node). also there are alot of parNewGc with about 700-1000ms and commit log 
> isolated disk, is utilized about 80-90%. on startup of these 3 nodes, there 
> are alot of "updateing topology" logs (1000s of them pending). 
> Using iperf, i'm sure that network is OK
> checking NTPs and mutations on each node, load is balanced among the nodes.
> using apache cassandra 3.11.2
> I can not not figure out the root cause of the problem, although there are 
> some obvious symptoms.
> Best Regards
> Sent using Zoho Mail <>

Re: jmxterm "#NullPointerException: No such PID "

2018-09-20 Thread Chris Lohfink
For what its worth, I highly recommend you remove that option in all
cassandra clusters first thing. A possibly non existent improvement (ie
/tmp on different low throughput drive) vs being able to diagnose issues is
a no brainer. You can measure or monitor gc logs for your safepoint pauses
to see if its ever a significant portion of your GC pauses.


On Thu, Sep 20, 2018 at 6:05 AM Philip Ó Condúin 

> Thank you Yuki, this explains it.
> I am used to working on C* 2.1 in production where this JVM flag is not
> enabled.
> On Wed, 19 Sep 2018 at 00:29, Yuki Morishita  wrote:
>> This is because Cassandra sets -XX:+PerfDisableSharedMem JVM option by
>> default.
>> This prevents tools such as jps to list jvm processes.
>> See for detail.
>> You can work around by doing what Riccardo said.
>> On Tue, Sep 18, 2018 at 9:41 PM Philip Ó Condúin
>>  wrote:
>> >
>> > Hi Riccardo,
>> >
>> > Yes that works for me:
>> >
>> > Welcome to JMX terminal. Type "help" for available commands.
>> > $> open localhost:7199
>> > #Connection to localhost:7199 is opened
>> > $>domains
>> > #following domains are available
>> > JMImplementation
>> > ch.qos.logback.classic
>> >
>> > java.lang
>> > java.nio
>> > java.util.logging
>> > org.apache.cassandra.db
>> > org.apache.cassandra.hints
>> > org.apache.cassandra.internal
>> > org.apache.cassandra.metrics
>> >
>> > org.apache.cassandra.request
>> > org.apache.cassandra.service
>> > $>
>> >
>> > I can work with this :-)
>> >
>> > Not sure why the JVM is not listed when issuing the JVMS command, maybe
>> its a server setting, our production servers find the Cass JVM.  I've spent
>> half the day trying to figure it out so I think I'll just put it to bed now
>> and work on something else.
>> >
>> > Regards,
>> > Phil
>> >
>> > On Tue, 18 Sep 2018 at 13:34, Riccardo Ferrari 
>> wrote:
>> >>
>> >> Hi Philip,
>> >>
>> >> I've used jmxterm myself without any problems particular problems. On
>> my systems too, I don't get the cassandra daemon listed when issuing the
>> `jvms` command but I never spent much time investigating it.
>> >> Assuming you have not changed anything relevant in the
>> you can connect using jmxterm by issuing 'open
>>'. Would that work for you?
>> >>
>> >> HTH,
>> >>
>> >>
>> >>
>> >> On Tue, Sep 18, 2018 at 2:00 PM, Philip Ó Condúin <
>>> wrote:
>> >>>
>> >>> Further info:
>> >>>
>> >>> I would expect to see the following when I list the jvm's:
>> >>>
>> >>> Welcome to JMX terminal. Type "help" for available commands.
>> >>> $>jvms
>> >>> 25815(m) - org.apache.cassandra.service.CassandraDaemon
>> >>> 17628( ) - jmxterm-1.0-alpha-4-uber.jar
>> >>>
>> >>> But jmxtem is not picking up the JVM for Cassandra for some reason.
>> >>>
>> >>> Can someone point me in the right direction?  Is there settings in
>> the file I need to amend to get jmxterm to find the cass
>> jvm?
>> >>>
>> >>> Im not finding much about it on google.
>> >>>
>> >>> Thanks,
>> >>> Phil
>> >>>
>> >>>
>> >>> On Tue, 18 Sep 2018 at 12:09, Philip Ó Condúin <
>>> wrote:
>> >>>>
>> >>>> Hi All,
>> >>>>
>> >>>> I need a little advice.  I'm trying to access the JMX terminal using
>> jmxterm-1.0-alpha-4-uber.jar with a very simple default install of C* 3.11.3
>> >>>>
>> >>>> I keep getting the following:
>> >>>>
>> >>>> [cassandra@reaper-1 conf]$ java -jar jmxterm-1.0-alpha-4-uber.jar
>> >>>> Welcome to JMX terminal. Type "help" for available commands.
>> >>>> $>open 1666
>> >>>> #NullPointerException: No such PID 1666
>> >>>> $>
>> >>>>
>> >>>> C* is running with a PID of 1666.  I've tried setting JMX_LOCAL=no
>> and have even created a new VM to test it.
>> >>>>
>> >>>> Does anyone know what I might be doing wrong here?
>> >>>>
>> >>>> Kind Regards,
>> >>>> Phil
>> >>>>
>> >>>
>> >>>
>> >>> --
>> >>> Regards,
>> >>> Phil
>> >>
>> >>
>> >
>> >
>> > --
>> > Regards,
>> > Phil
> --
> Regards,
> Phil

Re: Setting up rerouting java/python driver read requests from unresponsive nodes to good ones

2018-08-15 Thread Chris Lohfink
That’s what the retry handler does (see Horia’s response). You can also use the 
speculative retry to possibly send requests to multiple coordinators a little 
earlier as well to reduce the impact of the slow requests (ie a GC).


Sent from my iPhone

> On Aug 15, 2018, at 6:57 AM, Horia Mocioi  wrote:
> Hello,
> I believe that this is what you are looking for - 
> In particular, tryNextHost().
> Regards,
> Horia
>> On ons, 2018-08-15 at 14:16 +0300, Vsevolod Filaretov wrote:
>> Hello Cassandra community!
>> Unfortunately, I cannot find the corresponding info via load balancing 
>> manuals, so the question is:
>> Is it possible to set up java/python cassandra driver to redirect 
>> unsuccessful read requests from the coordinator node, which came to be 
>> unresponsive during the session, to the up and running one (dynamically 
>> switch to other coordinator node from the dead one)?
>> If the answer is no, what could be my alternatives?
>> Thank you all in advance,
>> Vsevolod Filaretov.

Re: Cassandra Compaction Metrics - CompletedTasks vs TotalCompactionCompleted

2018-08-10 Thread Chris Lohfink
If its occurring that often you can monitor nodetool compactionstats to see 
whats running

> On Aug 10, 2018, at 11:35 AM, Dionne Cloudoupoulos  
> wrote:
> On 2017/10/31 16:56:29, Chris Lohfink wrote:
>> The "CompletedTasks" metric is a measure of how many tasks ran on these two
>> executors combined.
>> The "TotalCompactionsCompleted" metric is a measure of how many compactions
>> issued from the compaction manager ran (normal compactions, cache writes,
>> scrub, 2i and MVs).  So while they may be close, depending on whats
>> happening on the system, theres no assurance that they will be within any
>> bounds of each other.
>all this is very interesting, but I do not understand why
> CompletedTasks grows at the rate of five thousand operations per hour in
> my cloud. Have an idea where can I look? kalo dromo
> -
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Re: concurrent_compactors via JMX

2018-07-18 Thread Chris Lohfink
Refer to Alains email but to strictly answer the question of increasing 
concurrent_compactors via jmx:

There are two attributes you can increase that would set the maximum number of 
concurrent compactions.

org.apache.cassandra.db:type=CompactionManager,name=MaximumCompactorThreads  -> 
org.apache.cassandra.db:type=CompactionManager,name=CoreCompactorThreads -> 6

Would set it to 6. To decrease them you will want to go opposite order (core 
than max). Just increasing the number of concurrent compactors doesnt mean that 
all of them will be utilized though.


> On Jul 17, 2018, at 12:18 PM, Alain RODRIGUEZ  wrote:
> Hello Riccardo,
> I noticed I have been writing a novel to answer a simple couple of questions 
> again ¯\_(ツ)_/¯. So here is a short answer in the case that's what you were 
> looking for :). Also, there is a warning that it might be counter-productive 
> and stress the cluster even more to increase the compaction throughput. There 
> is more information below ('about the issue').
> tl;dr: 
> What about using 'nodetool setcompactionthroughput XX' instead. It should 
> available there.
> In the same way 'nodetool getcompactionthroughput' gives you the current 
> value. Be aware that this change done through JMX/nodetool is not permanent. 
> You still need to update the cassandra.yaml file.
> If you really want to use the MBean through JMX, because using 'nodetool' is 
> too easy (or for any other reason :p):
> Mbean: org.apache.cassandra.service.StorageServiceMBean
> Attribute: CompactionThroughputMbPerSec
> Long story with the "how to" since I went through this search myself, I did 
> not know where this MBean was.
> Can someone point me to the right mbean? 
> I can not really find good docs about mbeans (or tools ...) 
> I am not sure about the doc, but you can use jmxterm 
> ( 
> <>).
> To replace the doc I use CCM ( 
> <>) + jconsole to find the mbeans locally:
> * Add loopback addresses for ccm (see the readme file)
> * then, create the cluster: * 'ccm create Cassandra-3-0-6 -v 3.0.6 -n 3 -s'
> * Start jconsole using the right pid: 'jconsole $(ccm node1 show | grep pid | 
> cut -d "=" -f 2)'
> * Explore MBeans, try to guess where this could be (and discover other funny 
> stuff in there :)).
> I must admit I did not find it this way using C*3.0.6 and jconsole. 
> I looked at the code, I locally used C*3.0.6 and ran 'grep -RiI 
> CompactionThroughput' with this result: 
> <>
> With this I could find the right MBean, the only code documentation that is 
> always up to date is the code itself I am afraid:
> './src/java/org/apache/cassandra/service/ 
> void setCompactionThroughputMbPerSec(int value);' 
> Note that the research in the code also leads to nodetool ;-).
> I could finally find the MBean in the 'jconsole' too: 
> <> (not sure how long this link will 
> live).
> jconsole also allows you to see what attributes it is possible to set or not.
> You can now find any other MBean you would need I hope :).
> see if it helps when the system is under stress
> About the issue
> You don't exactly say what you are observing, what is that "stress"? How is 
> it impacting the cluster?
> I ask because I am afraid this change might not help and even be 
> counter-productive. Even though having SSTables nicely compacted make a huge 
> difference at the read time, if that's already the case for you and the data 
> is already nicely compacted, doing this change won't help. It might even make 
> things slightly worse if the current bottleneck is the disk IO during a 
> stress period as the compactors would increase their disk read throughput, 
> thus maybe fight with the read requests for disk throughput.
> If you have a similar number of sstables on all nodes, not many compactions 
> pending (nodetool netstats -H) and read operations are hitting a small number 
> sstables (nodetool tablehistogram) then you probably don't need to increase 
> the compaction speed.
> Let's say that the compaction throughput is not often the cause of stress 
> during peak hours nor a direct way to make things 'faster'. Generally when 
> compaction goes wrong, the number of sstables goes through the roof. 

Re: Compaction process stuck

2018-07-05 Thread Chris Lohfink
That looks a bit to me like it isnt stuck but just a long running compaction. 
Can you include the output of `nodetool compactionstats` and the `nodetool 
cfstats` with schema for the table thats being compacted (redacted names if 

Can stop compaction with `nodetool stop COMPACTION` or restarting the node.


> On Jul 5, 2018, at 12:08 AM, atul atri  wrote:
> Hi,
> We noticed that compaction process is also hanging on a node in backup ring. 
> Please find attached thread dump for both servers. Recently, we have made few 
> changes in cluster topology.
> a. Added new server in backup data-center and decommissioned old server. 
> Backup ring only has 2 server.
> b. Added new node in primary data-center. Now it has 4 nods.
> Is there way we can stop this compaction? As we have added a new node in this 
> cluster and we are waiting to run cleanup on this node on which compaction is 
> hanging. I am afraid that cleanup will not start until compaction job 
> finishes. 
> Attachments:
> 1. cass-logg02.prod2.thread_dump.out: Thread dump from old node in primary 
> datacenter
> 2. cass-logg03.prod1.thread_dump.out: Thread dump from new node in backup 
> datacenter. This node is added recently.
> Your help is much appreciated. 
> Thanks & Regards,
> Atul Atri.
> On 4 July 2018 at 21:15, atul atri  <>> wrote:
> Hi Chris,
> Thanks for reply.
> Unfortunately, our servers do not have jstack installed. 
> I tried "kill -3 " option but that is also not generating thread dump. 
> Is there any other way I can generate thread dump?
> Thanks & Regards,
> Atul Atri.
> On 4 July 2018 at 20:32, Chris Lohfink  <>> wrote:
> Can you take a thread dump (jstack) and share the state of the compaction 
> threads? Also check for “Exception” in logs
> Chris
> Sent from my iPhone
> On Jul 4, 2018, at 8:37 AM, atul atri  <>> wrote:
>> Hi,
>> On one of our server, compaction process is hanging. It's stuck at 80%. It 
>> was stuck for last 3 days. And today we did a cluster restart (one host at 
>> time). And again it is stuck at same 80%. CPU usages are 100% and there 
>> seems no IO issue. We are seeing following kinds of WARNING in system.log
>> (line 226) Batch of prepared statements for [, 
>> *] is of size 7557, exceeding specified threshold of 5120 by 2437.
>> Other than this there seems no error.  I have tried to stop compaction 
>> process, but it does not stop. Cassandra version is 2.1.
>>  Can someone please guide us in solving this issue?
>> Thanks & Regards,
>> Atul Atri.
> -
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Re: Compaction process stuck

2018-07-04 Thread Chris Lohfink
Can you take a thread dump (jstack) and share the state of the compaction 
threads? Also check for “Exception” in logs


Sent from my iPhone

> On Jul 4, 2018, at 8:37 AM, atul atri  wrote:
> Hi,
> On one of our server, compaction process is hanging. It's stuck at 80%. It 
> was stuck for last 3 days. And today we did a cluster restart (one host at 
> time). And again it is stuck at same 80%. CPU usages are 100% and there seems 
> no IO issue. We are seeing following kinds of WARNING in system.log
> (line 226) Batch of prepared statements for [, *] 
> is of size 7557, exceeding specified threshold of 5120 by 2437.
> Other than this there seems no error.  I have tried to stop compaction 
> process, but it does not stop. Cassandra version is 2.1.
>  Can someone please guide us in solving this issue?
> Thanks & Regards,
> Atul Atri.

Re: G1GC CPU Spike

2018-06-15 Thread Chris Lohfink
There are no bad GCs in the gclog (worse is like 100ms). Everything looks great 
actually from what I see. CPU utilization isn't inherently a bad thing for what 
its worth.


> On Jun 14, 2018, at 1:18 PM, rajpal reddy  wrote:
> Hey Chris,
> Sorry to bother you. Did you get a chance to look at the gclog file I sent 
> last night.
> On Wed, Jun 13, 2018, 8:44 PM rajpal reddy  <>> wrote:
> Chris,
> sorry attached wrong log file. attaching gc collection seconds and cpu. there 
> were going high at the same time and also attached the gc.log. grafana 
> dashboard and gc.log timing are 4hours apart gc can be see 06/12th around 
> 22:50
> rate(jvm_gc_collection_seconds_sum{"}[5m])
> > On Jun 13, 2018, at 5:26 PM, Chris Lohfink  > <>> wrote:
> > 
> > There are not even a 100ms GC pause in that, are you certain theres a 
> > problem?
> > 
> >> On Jun 13, 2018, at 3:00 PM, rajpal reddy  >> <>> wrote:
> >> 
> >> Thanks Chris I did attached the gc logs already. reattaching them 
> >> now.
> >> 
> >> it started yesterday around 11:54PM 
> >>> On Jun 13, 2018, at 3:56 PM, Chris Lohfink  >>> <>> wrote:
> >>> 
> >>>> What is the criteria for picking up the value for G1ReservePercent?
> >>> 
> >>> 
> >>> it depends on the object allocation rate vs the size of the heap. 
> >>> Cassandra ideally would be sub 500-600mb/s allocations but it can spike 
> >>> pretty high with something like reading a wide partition or repair 
> >>> streaming which might exceed what the g1 ygcs tenuring and timing is 
> >>> prepared for from previous steady rate. Giving it a bigger buffer is a 
> >>> nice safety net for allocation spikes.
> >>> 
> >>>> is the HEAP_NEWSIZE is required only for CMS
> >>> 
> >>> 
> >>> it should only set Xmn with that if using CMS, with G1 it should be 
> >>> ignored or else yes it would be bad to set Xmn. Giving the gc logs will 
> >>> give the results of all the bash scripts along with details of whats 
> >>> happening so its your best option if you want help to share that.
> >>> 
> >>> Chris
> >>> 
> >>>> On Jun 13, 2018, at 12:17 PM, Subroto Barua 
> >>>>  wrote:
> >>>> 
> >>>> Chris,
> >>>> What is the criteria for picking up the value for G1ReservePercent?
> >>>> 
> >>>> Subroto 
> >>>> 
> >>>>> On Jun 13, 2018, at 6:52 AM, Chris Lohfink  >>>>> <>> wrote:
> >>>>> 
> >>>>> G1ReservePercent
> >>>> 
> >>>> -
> >>>> To unsubscribe, e-mail: 
> >>>> <>
> >>>> For additional commands, e-mail: 
> >>>> <>
> >>>> 
> >>> 
> >>> 
> >>> -
> >>> To unsubscribe, e-mail: 
> >>> <>
> >>> For additional commands, e-mail: 
> >>> <>
> >>> 
> >> 
> >> 
> >> 
> >> -
> >> To unsubscribe, e-mail: 
> >> <>
> >> For additional commands, e-mail: 
> >> <>
> > 
> > 
> > -
> > To unsubscribe, e-mail: 
> > <>
> > For additional commands, e-mail: 
> > <>
> > 

Re: G1GC CPU Spike

2018-06-13 Thread Chris Lohfink
There are not even a 100ms GC pause in that, are you certain theres a problem?

> On Jun 13, 2018, at 3:00 PM, rajpal reddy  wrote:
> Thanks Chris I did attached the gc logs already. reattaching them 
> now.
> it started yesterday around 11:54PM 
>> On Jun 13, 2018, at 3:56 PM, Chris Lohfink  wrote:
>>> What is the criteria for picking up the value for G1ReservePercent?
>> it depends on the object allocation rate vs the size of the heap. Cassandra 
>> ideally would be sub 500-600mb/s allocations but it can spike pretty high 
>> with something like reading a wide partition or repair streaming which might 
>> exceed what the g1 ygcs tenuring and timing is prepared for from previous 
>> steady rate. Giving it a bigger buffer is a nice safety net for allocation 
>> spikes.
>>> is the HEAP_NEWSIZE is required only for CMS
>> it should only set Xmn with that if using CMS, with G1 it should be ignored 
>> or else yes it would be bad to set Xmn. Giving the gc logs will give the 
>> results of all the bash scripts along with details of whats happening so its 
>> your best option if you want help to share that.
>> Chris
>>> On Jun 13, 2018, at 12:17 PM, Subroto Barua  
>>> wrote:
>>> Chris,
>>> What is the criteria for picking up the value for G1ReservePercent?
>>> Subroto 
>>>> On Jun 13, 2018, at 6:52 AM, Chris Lohfink  wrote:
>>>> G1ReservePercent
>>> -
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
>> -
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> -
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

Re: G1GC CPU Spike

2018-06-13 Thread Chris Lohfink
> What is the criteria for picking up the value for G1ReservePercent?

it depends on the object allocation rate vs the size of the heap. Cassandra 
ideally would be sub 500-600mb/s allocations but it can spike pretty high with 
something like reading a wide partition or repair streaming which might exceed 
what the g1 ygcs tenuring and timing is prepared for from previous steady rate. 
Giving it a bigger buffer is a nice safety net for allocation spikes.

> is the HEAP_NEWSIZE is required only for CMS

it should only set Xmn with that if using CMS, with G1 it should be ignored or 
else yes it would be bad to set Xmn. Giving the gc logs will give the results 
of all the bash scripts along with details of whats happening so its your best 
option if you want help to share that.


> On Jun 13, 2018, at 12:17 PM, Subroto Barua  
> wrote:
> Chris,
> What is the criteria for picking up the value for G1ReservePercent?
> Subroto 
>> On Jun 13, 2018, at 6:52 AM, Chris Lohfink  wrote:
>> G1ReservePercent
> -
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

Re: G1GC CPU Spike

2018-06-13 Thread Chris Lohfink
That metric is the total number of seconds spent in GC, it will increase over 
time with every young gc which is expected. Whats interesting is the rate of 
growth not the fact that its increasing. If graphing tool has option to graph 
derivative you should use that instead.


> On Jun 13, 2018, at 9:51 AM, rajpal reddy  wrote:
> jvm_gc_collection_seconds_count{gc="G1 Young Generation”} and also young 
> generation seconds count keep increasing
>> On Jun 13, 2018, at 9:52 AM, Chris Lohfink > <>> wrote:
>> The gc log file is best to share when asking for help with tuning. The top 
>> of file has all the computed args it ran with and it gives details on what 
>> part of the GC is taking time. I would guess the CPU spike is from full GCs 
>> which with that small heap of a heap is probably from evacuation failures. 
>> Reserving more of the heap to be free (-XX:G1ReservePercent=25) can help, 
>> along with increasing the amount of heap. 8GB is pretty small for G1, might 
>> be better off with CMS.
>> Chris
>>> On Jun 13, 2018, at 8:42 AM, rajpal reddy >> <>> wrote:
>>> Hello,
>>> we are using G1GC and noticing garbage collection taking a while and during 
>>> that process we are seeing cpu spiking up to 70-80%. can you please let us 
>>> know. if we have to tune any parameters for that. attaching the 
>>> cassandra-env file with jam-options.
>>> -
>>> To unsubscribe, e-mail: 
>>> <>
>>> For additional commands, e-mail: 
>>> <>
>> -
>> To unsubscribe, e-mail: 
>> <>
>> For additional commands, e-mail: 
>> <>

Re: G1GC CPU Spike

2018-06-13 Thread Chris Lohfink
The gc log file is best to share when asking for help with tuning. The top of 
file has all the computed args it ran with and it gives details on what part of 
the GC is taking time. I would guess the CPU spike is from full GCs which with 
that small heap of a heap is probably from evacuation failures. Reserving more 
of the heap to be free (-XX:G1ReservePercent=25) can help, along with 
increasing the amount of heap. 8GB is pretty small for G1, might be better off 
with CMS.


> On Jun 13, 2018, at 8:42 AM, rajpal reddy  wrote:
> Hello,
> we are using G1GC and noticing garbage collection taking a while and during 
> that process we are seeing cpu spiking up to 70-80%. can you please let us 
> know. if we have to tune any parameters for that. attaching the cassandra-env 
> file with jam-options.
> -
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

Re: nodetool (2.1.18) - Xmx, ParallelGCThreads, High CPU usage

2018-05-29 Thread Chris Lohfink
Might be better to disable explicit gcs so the full gcs don’t even occur. It’s 
likely from the rmi dgc or directbytebuffers not any actual need to do gcs or 
the concurrent gc threads would be an issue as well.

Nodetool also has no excuse to use that big of a heap so it should have max 
size capped too (along with parallel and concurrent gc threads).


Sent from my iPhone

> On May 29, 2018, at 4:42 PM, kurt greaves  wrote:
> Good to know. So that confirms it's just the GC threads causing problems.
>> On Tue., 29 May 2018, 22:02 Steinmaurer, Thomas, 
>>  wrote:
>> Kurt,
>> in our test it also didn’t made a difference with the default number of GC 
>> Threads (43 on our large machine) and running with Xmx128M or XmX31G 
>> (derived from $MAX_HEAP_SIZE). For both Xmx, we saw the high CPU caused by 
>> nodetool.
>> Regards,
>> Thomas
>> From: kurt greaves [] 
>> Sent: Dienstag, 29. Mai 2018 13:06
>> To: User 
>> Subject: Re: nodetool (2.1.18) - Xmx, ParallelGCThreads, High CPU usage
>> Thanks Thomas. After a bit more research today I found that the whole 
>> $MAX_HEAP_SIZE issue isn't really a problem because we don't explicitly set 
>> -Xms so the minimum heapsize by default will be 256mb, which isn't hugely 
>> problematic, and it's unlikely more than that would get allocated.
>> On 29 May 2018 at 09:29, Steinmaurer, Thomas 
>>  wrote:
>> Hi Kurt,
>> thanks for pointing me to the Xmx issue.
>> JIRA + patch (for Linux only based on C* 3.11) for the parallel GC thread 
>> issue is available here: 
>> Thanks,
>> Thomas
>> From: kurt greaves [] 
>> Sent: Dienstag, 29. Mai 2018 05:54
>> To: User 
>> Subject: Re: nodetool (2.1.18) - Xmx, ParallelGCThreads, High CPU usage
>> 1) nodetool is reusing the $MAX_HEAP_SIZE environment variable, thus if we 
>> are running Cassandra with e.g. Xmx31G, nodetool is started with Xmx31G as 
>> well
>> This was fixed in 3.0.11/3.10 in CASSANDRA-12739. Not sure why it didn't 
>> make it into 2.1/2.2.
>> 2) As -XX:ParallelGCThreads is not explicitly set upon startup, this 
>> basically defaults to a value dependent on the number of cores. In our case, 
>> with the machine above, the number of parallel GC threads for the JVM is set 
>> to 43!
>> 3) Test-wise, we have adapted the nodetool startup script in a way to get a 
>> Java Flight Recording file on JVM exit, thus with each nodetool invocation 
>> we can inspect a JFR file. Here we may have seen System.gc() calls (without 
>> visible knowledge where they come from), GC times for the entire JVM 
>> life-time (e.g. ~1min) showing high cpu. This happened for both Xmx128M 
>> (default as it seems) and Xmx31G
>> After explicitly setting -XX:ParallelGCThreads=1 in the nodetool startup 
>> script, CPU usage spikes by nodetool are entirely gone.
>> Is this something which has been already adapted/tackled in Cassandra 
>> versions > 2.1 or worth to be considered as some sort of RFC?
>> Can you create a JIRA for this (and a patch, if you like)? We should be 
>> explicitly setting this on nodetool invocations.
>> ​
>> The contents of this e-mail are intended for the named addressee only. It 
>> contains information that may be confidential. Unless you are the named 
>> addressee or an authorized designee, you may not copy or use it, or disclose 
>> it to anyone else. If you received it in error please notify us immediately 
>> and then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) 
>> is a company registered in Linz whose registered office is at 4040 Linz, 
>> Austria, Freistädterstraße 313
>> The contents of this e-mail are intended for the named addressee only. It 
>> contains information that may be confidential. Unless you are the named 
>> addressee or an authorized designee, you may not copy or use it, or disclose 
>> it to anyone else. If you received it in error please notify us immediately 
>> and then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) 
>> is a company registered in Linz whose registered office is at 4040 Linz, 
>> Austria, Freistädterstraße 313

"Group by" while limiting a clustering column with a range

2018-04-10 Thread Chris Mildebrandt
Hey all. I’m trying to use a range to limit a clustering column while at
the same time using `group by` and running into issues. Here’s a sample
create table if not exists samples (name text, partition int, sample int,
city text, state text, count counter, primary key ((name, partition),
sample, city, state)) with clustering order by (sample desc);

When I filter `sample` by a range, I get an error:
select city, state, sum(count) from samples where name='bob' and
partition=1 and sample>=1 and sample<=3 group by city, state;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Group
by currently only support groups of columns following their declared order

However, it allows the query when I change from a range to an equals:
select city, state, sum(count) from samples where name='bob' and
partition=1 and sample=1 group by city, state;

 city   | state | system.sum(count)
 Austin |TX | 2
 Denver |CO | 1

Does this sound like a bug, or is it expected? Thanks.

Re: tablestats and gossip

2018-04-06 Thread Chris Lohfink
Yes, its the count of all locally applied writes to that table. A insert to a 
table with a RF=3 should increase the local write count by 1 on 3 different 


> On Apr 6, 2018, at 5:00 AM, Grzegorz Pietrusza <> wrote:
> Hi all
> Does local write count provided by tablestats include writes from gossip?

To unsubscribe, e-mail:
For additional commands, e-mail:

Re: Understanding Blocked and All Time Blocked columns in tpstats

2018-03-23 Thread Chris Lohfink
Increasing queue would increase the number of requests waiting. It could make 
GCs worse if the requests are like large INSERTs, but for a lot of super tiny 
queries it helps to increase queue size (to a point). Might want to look into 
what and how queries are being made, since there are possibly options to help 
with that (ie prepared queries, what queries are, limiting number of async 
inflight queries)


> On Mar 23, 2018, at 11:42 AM, John Sanda <> wrote:
> Thanks for the explanation. In the past when I have run into problems related 
> to CASSANDRA-11363, I have increased the queue size via the 
> cassandra.max_queued_native_transport_requests system property. If I find 
> that the queue is frequently at capacity, would that be an indicator that the 
> node is having trouble keeping up with the load? And if so, will increasing 
> the queue size just exacerbate the problem?
> On Fri, Mar 23, 2018 at 11:51 AM, Chris Lohfink < 
> <>> wrote:
> It blocks the caller attempting to add the task until theres room in queue, 
> applying back pressure. It does not reject it. It mimics the behavior from 
> pre-SEP DebuggableThreadPoolExecutor's RejectionExecutionHandler that the 
> other thread pools use (exception on sampling/trace which just throw away on 
> rejections).
> Worth noting this is only really possible in the native transport pool (sep 
> pool) last I checked. Since 2.1 at least, before that there were a few 
> others. That changes version to version. For (basically) all other thread 
> pools the queue is limited by memory.
> Chris
>> On Mar 22, 2018, at 10:44 PM, John Sanda < 
>> <>> wrote:
>> I have been doing some work on a cluster that is impacted by 
>> <>. Reading through the 
>> ticket prompted me to take a closer look at 
>> org.apache.cassandra.concurrent.SEPExecutor. I am looking at the 3.0.14 
>> code. I am a little confused about the Blocked and All Time Blocked columns 
>> reported in nodetool tpstats and reported by StatusLogger. I understand that 
>> there is a queue for tasks. In the case of RequestThreadPoolExecutor, the 
>> size of that queue can be controlled via the 
>> cassandra.max_queued_native_transport_requests system property.
>> I have been looking at SEPExecutor.addTask(FutureTask task), and here is 
>> my question. If the queue is full, as defined by SEPExector.maxTasksQueued, 
>> are tasks rejected? I do not fully grok the code, but it looks like it is 
>> possible for tasks to be rejected here (some code and comments omitted for 
>> brevity):
>> public void addTask(FutureTask task)
>> {
>> tasks.add(task);
>> ...
>> else if (taskPermits >= maxTasksQueued) 
>> {
>> WaitQueue.Signal s = hasRoom.register();
>> if (taskPermits(permits.get()) > maxTasksQueued)
>> {
>> if (takeWorkPermit(true))
>> pool.schedule(new Work(this))
>> s.awaitUninterruptibly();
>> metrics.currentBlocked.dec();
>> }
>> else
>> s.cancel();
>> }   
>> }
>> The first thing that happens is that the task is added to the tasks queue. 
>> pool.schedule() only gets called if takeWorkPermit() returns true. I am 
>> still studying the code, but can someone explain what exactly happens when 
>> the queue is full?
>> - John
> -- 
> - John

Re: Understanding Blocked and All Time Blocked columns in tpstats

2018-03-23 Thread Chris Lohfink
It blocks the caller attempting to add the task until theres room in queue, 
applying back pressure. It does not reject it. It mimics the behavior from 
pre-SEP DebuggableThreadPoolExecutor's RejectionExecutionHandler that the other 
thread pools use (exception on sampling/trace which just throw away on 

Worth noting this is only really possible in the native transport pool (sep 
pool) last I checked. Since 2.1 at least, before that there were a few others. 
That changes version to version. For (basically) all other thread pools the 
queue is limited by memory.


> On Mar 22, 2018, at 10:44 PM, John Sanda <> wrote:
> I have been doing some work on a cluster that is impacted by 
> <>. Reading through the 
> ticket prompted me to take a closer look at 
> org.apache.cassandra.concurrent.SEPExecutor. I am looking at the 3.0.14 code. 
> I am a little confused about the Blocked and All Time Blocked columns 
> reported in nodetool tpstats and reported by StatusLogger. I understand that 
> there is a queue for tasks. In the case of RequestThreadPoolExecutor, the 
> size of that queue can be controlled via the 
> cassandra.max_queued_native_transport_requests system property.
> I have been looking at SEPExecutor.addTask(FutureTask task), and here is 
> my question. If the queue is full, as defined by SEPExector.maxTasksQueued, 
> are tasks rejected? I do not fully grok the code, but it looks like it is 
> possible for tasks to be rejected here (some code and comments omitted for 
> brevity):
> public void addTask(FutureTask task)
> {
> tasks.add(task);
> ...
> else if (taskPermits >= maxTasksQueued) 
> {
> WaitQueue.Signal s = hasRoom.register();
> if (taskPermits(permits.get()) > maxTasksQueued)
> {
> if (takeWorkPermit(true))
> pool.schedule(new Work(this))
> s.awaitUninterruptibly();
> metrics.currentBlocked.dec();
> }
> else
> s.cancel();
> }   
> }
> The first thing that happens is that the task is added to the tasks queue. 
> pool.schedule() only gets called if takeWorkPermit() returns true. I am still 
> studying the code, but can someone explain what exactly happens when the 
> queue is full?
> - John

Re: Delete System_Traces Table

2018-03-19 Thread Chris Lohfink
traces and auth in that version have a whitelist of tables that can be dropped 
(legacy auth tables).

It does make sense to allowing CREATEs in the distributed tables, mostly 
because of auth. That way if the auth tables are changed in later version you 
can pre-prime them before an upgrade. Might be a bit of overstep in protecting 
users from themselves but it doesnt hurt anything to have the table there.  
Just ignore it and its existence will not cause any issues.


> On Mar 19, 2018, at 10:27 AM, shalom sagges <> wrote:
> That's weird... I'm using 3.0.12, so I should've still been able to drop it, 
> no?
> Also, if I intend to upgrade to version 3.11.2, will the existence of the 
> table cause any issues?
> Thanks!
> On Mon, Mar 19, 2018 at 4:30 PM, Chris Lohfink < 
> <>> wrote:
> Oh I misread original, I see.
> With 
> <> you wont be able to 
> drop the table, but would be worth a ticket to prevent creation in those 
> keyspaces or allow some sort of override if allowing create.
> Chris
>> On Mar 19, 2018, at 9:15 AM, shalom sagges < 
>> <>> wrote:
>> Yes, that's correct. 
>> I'd definitely like to keep the default tables. 
>> On Mon, Mar 19, 2018 at 4:10 PM, Rahul Singh < 
>> <>> wrote:
>> I think he just wants to delete the test table not the whole keyspace. Is 
>> that correct?
>> --
>> Rahul Singh
>> <>
>> Anant Corporation
>> On Mar 19, 2018, 9:08 AM -0500, Chris Lohfink < 
>> <>>, wrote:
>>> No.
>>> Why do you want to? If you don't use tracing they will be empty, and if 
>>> were able to drop them you will no longer be able to use tracing in 
>>> debugging.
>>> Chris
>>>> On Mar 19, 2018, at 7:52 AM, shalom sagges < 
>>>> <>> wrote:
>>>> Hi All,
>>>> I accidentally created a test table on the system_traces keyspace.
>>>> When I tried to drop the table with the Cassandra user, I got the 
>>>> following error:
>>>> Unauthorized: Error from server: code=2100 [Unauthorized] message="Cannot 
>>>> DROP "
>>>> Is there a way to drop this table permanently?
>>>> Thanks!

Re: Delete System_Traces Table

2018-03-19 Thread Chris Lohfink
Oh I misread original, I see.

<> you wont be able to 
drop the table, but would be worth a ticket to prevent creation in those 
keyspaces or allow some sort of override if allowing create.


> On Mar 19, 2018, at 9:15 AM, shalom sagges <> wrote:
> Yes, that's correct. 
> I'd definitely like to keep the default tables. 
> On Mon, Mar 19, 2018 at 4:10 PM, Rahul Singh < 
> <>> wrote:
> I think he just wants to delete the test table not the whole keyspace. Is 
> that correct?
> --
> Rahul Singh
> <>
> Anant Corporation
> On Mar 19, 2018, 9:08 AM -0500, Chris Lohfink < 
> <>>, wrote:
>> No.
>> Why do you want to? If you don't use tracing they will be empty, and if were 
>> able to drop them you will no longer be able to use tracing in debugging.
>> Chris
>>> On Mar 19, 2018, at 7:52 AM, shalom sagges < 
>>> <>> wrote:
>>> Hi All,
>>> I accidentally created a test table on the system_traces keyspace.
>>> When I tried to drop the table with the Cassandra user, I got the following 
>>> error:
>>> Unauthorized: Error from server: code=2100 [Unauthorized] message="Cannot 
>>> DROP "
>>> Is there a way to drop this table permanently?
>>> Thanks!

Re: Delete System_Traces Table

2018-03-19 Thread Chris Lohfink

Why do you want to? If you don't use tracing they will be empty, and if were 
able to drop them you will no longer be able to use tracing in debugging.


> On Mar 19, 2018, at 7:52 AM, shalom sagges <> wrote:
> Hi All, 
> I accidentally created a test table on the system_traces keyspace. 
> When I tried to drop the table with the Cassandra user, I got the following 
> error:
> Unauthorized: Error from server: code=2100 [Unauthorized] message="Cannot 
> DROP "
> Is there a way to drop this table permanently? 
> Thanks!

Re: WARN [PERIODIC-COMMIT-LOG-SYNCER] .. exceeded the configured commit interval by an average of...

2018-03-16 Thread Chris Lohfink
If you just want to make it work, increase commitlog_segment_size_in_mb  to 64. 
A single mutation cannot exceed 1/2 the segment size.

If you want to actually fix your problem decrease the size of the mutations and 
limit the size of the value blob. <== recommended


> On Mar 16, 2018, at 9:19 AM, Frank Limstrand <> wrote:
> Hi listmembers 
> Basic info
> [cqlsh 5.0.1 | Cassandra 3.11.2 | CQL spec 3.4.4 | Native protocol v4]
> CREATE KEYSPACE mykeyspace WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': '3'}  AND durable_writes = true;
> 8 linux nodes, SSD. 64GB memory on each server.
> Additional information after the signature.
> We are about to enter production with this new cluster and are using our own 
> (homemade) application to test with.
> Problem
> We see this frequently in system.log on all servers:
> Timestamp WARN  [PERIODIC-COMMIT-LOG-SYNCER] - Out of 27 
> commit log syncs over the past 266.24s with average duration of 53.21ms, 1 
> have exceeded the configured commit interval by an average of 3.89ms
> (The last ms number vary from log messages to log message but is never over 
> 1000ms, more in the 100 ms range)
> We have had one ERROR log message on one node:
> Timestamp ERROR [MutationStage-2] - Failed to apply 
> mutation locally : {}
> java.lang.IllegalArgumentException: Mutation of 24.142MiB is too large for 
> the maximum size of 16.000MiB
> On two other nodes we got this
> Timestamp WARN  [MutationStage-3] 
> - Uncaught exception on thread Thread[MutationStage-3,5,main]: {}
> java.lang.IllegalArgumentException: Mutation of 24.142MiB is too large for 
> the maximum size of 16.000MiB
> Our application got this in the log
> Cassandra failure during write query at consistency QUORUM (2 responses were 
> required but only 0 replica responded, 2 failed)
> com.datastax.driver.core.exceptions.WriteFailureException: Cassandra failure 
> during write query at consistency QUORUM (2 responses were required but only 
> 0 replica responded, 2 failed)
> Are the WARNings a sign that there can be ERRORs like this? Are they related 
> somehow?
> We decided to relax some performance parameters in our application and the 
> WARN log messages now come very seldomly but they are there. We have seen the 
> same WARN log message at nightime when we don't run our application at all so 
> WARN messages were unexpected.
> There are no GC warnings about long pauses.
> Any thoughts about how to proceed with this issue?
> Kind regards
> Frank Limstrand
> National Library of Norway
> All tables created like this:
> CREATE TABLE mykeyspace.mytable (
> key blob,
> column1 timeuuid,
> column2 text,
> value blob,
> PRIMARY KEY (key, column1, column2)
> AND CLUSTERING ORDER BY (column1 ASC, column2 ASC)
> AND bloom_filter_fp_chance = 0.01
> AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
> AND comment = 'Column Family for storing job execution record information'
> AND compaction = {'class': 
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
> 'max_threshold': '32', 'min_threshold': '4'}
> AND compression = {'chunk_length_in_kb': '64', 'class': 
> ''}
> AND crc_check_chance = 1.0
> AND dclocal_read_repair_chance = 0.1
> AND default_time_to_live = 0
> AND gc_grace_seconds = 864000
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair_chance = 0.0
> AND speculative_retry = '99PERCENTILE';
> cassandra.yaml:
> hinted_handoff_enabled: true
> max_hint_window_in_ms: 1080 # 3 hours
> hinted_handoff_throttle_in_kb: 1024
> max_hints_delivery_threads: 2
> hints_directory: /d1/cassandra/data/hints
> hints_flush_period_in_ms: 1
> max_hints_file_size_in_mb: 128
> batchlog_replay_throttle_in_kb: 1024
> authenticator: AllowAllAuthenticator
> authorizer: AllowAllAuthorizer
> role_manager: CassandraRoleManager
> roles_validity_in_ms: 2000
> permissions_validity_in_ms: 2000
> credentials_validity_in_ms: 2000
> partitioner: org.apache.cassandra.dht.RandomPartitioner
> data_file_directories:
> - /d1/cassandra/data
> - /d2/cassandra/data
> commitlog_directory: /d1/cassandra/commitlog
> cdc_enabled: false
> disk_failure_policy: stop
> commit_failure_policy: stop
> prepared_statements_cache_size_mb:
> thrift_prepared_statements_cache_size_mb:

Re: system.size_estimates - safe to remove sstables?

2018-03-06 Thread Chris Lohfink
While its off you can delete the files in the directory yeah


> On Mar 6, 2018, at 2:35 AM, Kunal Gangakhedkar <> 
> wrote:
> Hi Chris,
> I checked for snapshots and backups - none found.
> Also, we're not using opscenter, hadoop or spark or any such tool.
> So, do you think we can just remove the cf and restart the service?
> Thanks,
> Kunal
> On 5 March 2018 at 21:52, Chris Lohfink < 
> <>> wrote:
> Any chance space used by snapshots? What files exist there that are taking up 
> space?
> > On Mar 5, 2018, at 1:02 AM, Kunal Gangakhedkar < 
> > <>> wrote:
> >
> > Hi all,
> >
> > I have a 2-node cluster running cassandra 2.1.18.
> > One of the nodes has run out of disk space and died - almost all of it 
> > shows up as occupied by size_estimates CF.
> > Out of 296GiB, 288GiB shows up as consumed by size_estimates in 'du -sh' 
> > output.
> >
> > This is while the other node is chugging along - shows only 25MiB consumed 
> > by size_estimates (du -sh output).
> >
> > Any idea why this descripancy?
> > Is it safe to remove the size_estimates sstables from the affected node and 
> > restart the service?
> >
> > Thanks,
> > Kunal
> -
> To unsubscribe, e-mail: 
> <>
> For additional commands, e-mail: 
> <>

Re: cfhistograms InstanceNotFoundException EstimatePartitionSizeHistogram

2018-03-06 Thread Chris Lohfink
Make sure your using same version of nodetool as your version of Cassandra. 
That metric was renamed from EstimatedRowSize so if using a version of nodetool 
made for a more recent version you would get this error since 
EstimatePartitionSizeHistogram doesn’t exist on the older Cassandra host.


Sent from my iPhone

> On Mar 6, 2018, at 3:29 AM, onmstester onmstester <> wrote:
> Running this command:
> nodetools cfhistograms keyspace1 table1
> throws this exception in production server:
> org.apache.cassandra.metrics:type=Table,keyspace=keyspace1,scope=table1,name=EstimatePartitionSizeHistogram
> But i have no problem in a test server with few data in it and same datamodel.
> I'm using Casssandra 3.
> Sent using Zoho Mail

Re: system.size_estimates - safe to remove sstables?

2018-03-05 Thread Chris Lohfink
Any chance space used by snapshots? What files exist there that are taking up 

> On Mar 5, 2018, at 1:02 AM, Kunal Gangakhedkar  
> wrote:
> Hi all,
> I have a 2-node cluster running cassandra 2.1.18.
> One of the nodes has run out of disk space and died - almost all of it shows 
> up as occupied by size_estimates CF.
> Out of 296GiB, 288GiB shows up as consumed by size_estimates in 'du -sh' 
> output.
> This is while the other node is chugging along - shows only 25MiB consumed by 
> size_estimates (du -sh output).
> Any idea why this descripancy?
> Is it safe to remove the size_estimates sstables from the affected node and 
> restart the service?
> Thanks,
> Kunal

To unsubscribe, e-mail:
For additional commands, e-mail:

Re: system.size_estimates - safe to remove sstables?

2018-03-05 Thread Chris Lohfink
Unless using spark or hadoop nothing consumes the data in that table (unless 
you have tooling that may use it like opscenter or something) so your safe to 
just truncate it or rm the sstables when instance offline you will be fine, if 
you do use that table you can then do a `nodetool refreshsizeestimates` to 
readd it or just wait for it to re-run automatically (every 5 min).


> On Mar 5, 2018, at 1:02 AM, Kunal Gangakhedkar <> 
> wrote:
> Hi all,
> I have a 2-node cluster running cassandra 2.1.18.
> One of the nodes has run out of disk space and died - almost all of it shows 
> up as occupied by size_estimates CF.
> Out of 296GiB, 288GiB shows up as consumed by size_estimates in 'du -sh' 
> output.
> This is while the other node is chugging along - shows only 25MiB consumed by 
> size_estimates (du -sh output).
> Any idea why this descripancy?
> Is it safe to remove the size_estimates sstables from the affected node and 
> restart the service?
> Thanks,
> Kunal

To unsubscribe, e-mail:
For additional commands, e-mail:

Re: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread Chris Lohfink
Instead of saying "Make X better" you can quantify "Here's how we can make X 
better" in a jira and the conversation will continue with interested parties 
(opening jiras are free!). Being combative and insulting project on mailing 
list may help vent some frustrations but it is counter productive and makes 
people defensive.

People are not averse to usability, quite the opposite actually. People do tend 
to be averse to conversations opened up with "cassandra is an idiot" with no 
clear definition of how to make it better or what a better solution would look 
like though. Note however that saying "make backups better" or "look at 
marketing literature for these guys" is hard for an engineer or architect to 
break into actionable item. Coming up with cool ideas on how to do something 
will more likely hook a developer into working on it then trying to shame the 
community with a sales pitch from another DB's sales guy.


> On Feb 21, 2018, at 4:53 PM, Kenneth Brotman <> 
> wrote:
> Hi Akash,
> I get the part about outside work which is why in replying to Jeff Jirsa I 
> was suggesting the big companies could justify taking it on easy enough and 
> you know actually pay the people who would be working at it so those people 
> could have a life.
> The part I don't get is the aversion to usability.  Isn't that what you think 
> about when you are coding?  "Am I making this thing I'm building easy to 
> use?"  If you were programming for me, we would be constantly talking about 
> what we are building and how we can make things easier for users.  If I had 
> to fight with a developer, architect or engineer about usability all the 
> time, they would be gone and quick.  How do approach programming if you 
> aren't trying to make things easy.
> Kenneth Brotman
> -Original Message-
> From: Akash Gangil [] 
> Sent: Wednesday, February 21, 2018 2:24 PM
> To:
> Cc:
> Subject: Re: Cassandra Needs to Grow Up by Version Five!
> I would second Jon in the arguments he made. Contributing outside work is 
> draining and really requires a lot of commitment. If someone requires 
> features around usability etc, just pay for it, period.
> On Wed, Feb 21, 2018 at 2:20 PM, Kenneth Brotman < 
>> wrote:
>> Jon,
>> Very sorry that you don't see the value of the time I'm taking for this.
>> I don't have demands; I do have a stern warning and I'm right Jon.  
>> Please be very careful not to mischaracterized my words Jon.
>> You suggest I put things in JIRA's, then seem to suggest that I'd be 
>> lucky if anyone looked at it and did anything. That's what I figured too.
>> I don't appreciate the hostility.  You will understand more fully in 
>> the next post where I'm coming from.  Try to keep the conversation civilized.
>> I'm trying or at least so you understand I think what I'm doing is 
>> saving your gig and mine.  I really like a lot of people is this group.
>> I've come to a preliminary assessment on things.  Soon the cloud will 
>> clear or I'll be gone.  Don't worry.  I'm a very peaceful person and 
>> like you I am driven by real important projects that I feel compelled 
>> to work on for the good of others.  I don't have time for people to 
>> hand hold a database and I can't get stuck with my projects on the wrong 
>> stuff.
>> Kenneth Brotman
>> -Original Message-
>> From: Jon Haddad [] On Behalf Of Jon 
>> Haddad
>> Sent: Wednesday, February 21, 2018 12:44 PM
>> To:
>> Cc:
>> Subject: Re: Cassandra Needs to Grow Up by Version Five!
>> Ken,
>> Maybe it’s not clear how open source projects work, so let me try to 
>> explain.  There’s a bunch of us who either get paid by someone or 
>> volunteer on our free time.  The folks that get paid, (yay!) usually 
>> take direction on what the priorities are, and work on projects that 
>> directly affect our jobs.  That means that someone needs to care 
>> enough about the features you want to work on them, if you’re not going to 
>> do it yourself.
>> Now as others have said already, please put your list of demands in 
>> JIRA, if someone is interested, they will work on it.  You may need to 
>> contribute a little more than you’ve done already, be prepared to get 
>> involved if you actually want to to see something get done.  Perhap

Re: Commitlogs are filling the Full Disk space and nodes are down

2018-01-30 Thread Chris Lohfink
The commitlog growing is often a symptom of a problem. If the memtable flush or 
post flush fails in anyway, the commitlogs will not be recycled/deleted and 
will continue to pool up.

Might want to go back in logs earlier to make sure theres nothing like the 
postmemtable flusher getting a permission error (some tooling creates 
commitlogs so if run by wrong user can create this prooblem), or a memtable 
flush error.  You can also check tpstats to see if tasks are queued up in 
postmemtable flusher and jstack to see where the active ones are stuck if they 


> On Jan 30, 2018, at 4:20 AM, Amit Singh <> wrote:
> Hi,
> When you actually say nodetool flush, data from memTable goes to disk based 
> structure as SStables and side by side , commit logs segments for that 
> particular data get written off and its continuous process . May be in your 
> case , you can decrease the value of  below uncommented property in 
> Cassandra.yaml 
> commitlog_total_space_in_mb
> Also this is what is it used for 
> # Total space to use for commit logs on disk.
> #
> # If space gets above this value, Cassandra will flush every dirty CF
> # in the oldest segment and remove it.  So a small total commitlog space
> # will tend to cause more flush activity on less-active columnfamilies.
> #
> # The default value is the smaller of 8192, and 1/4 of the total space
> # of the commitlog volume.
> From: Mokkapati, Bhargav (Nokia - IN/Chennai) 
> [] 
> Sent: Tuesday, January 30, 2018 4:00 PM
> To:
> Subject: Commitlogs are filling the Full Disk space and nodes are down
> Hi Team,
> My Cassandra version : Apache Cassandra 3.0.13
> Cassandra nodes are down due to Commitlogs are getting filled up until full 
> disk size.
> With “Nodetool flush” I didn’t see any commitlogs deleted.
> Can anyone tell me how to flush the commitlogs without losing data.
> Thanks,
> Bhargav M

Re: sstabledump tries to delete a file

2018-01-10 Thread Chris Lohfink
Yes it should be read only, open a jira please. It does look like if the fp
changed it would rebuild or if your missing. When it builds the table
metadata from the sstable it can just set the properties to match that of
the sstable to prevent this.


On Wed, Jan 10, 2018 at 4:16 AM, Python_Max <> wrote:

> Hello all.
> I have an error when trying to dump SSTable (Cassandra 3.11.1):
> $ sstabledump mc-56801-big-Data.db
> Exception in thread "main" FSWriteError in /var/lib/cassandra/data/<
> keyspace>//mc-56801-big-Summary.db
> at
> at
> at
> saveSummary(
> at
> saveSummary(
> at
> load(
> at
> load(
> at
> open(
> at
> openNoValidation(
> at
> Caused by: java.nio.file.AccessDeniedException: /var/lib/cassandra/data/<
> keyspace>//mc-56801-big-Summary.db
> at sun.nio.fs.UnixException.translateToIOException(
> at sun.nio.fs.UnixException.rethrowAsIOException(
> at sun.nio.fs.UnixException.rethrowAsIOException(
> at sun.nio.fs.UnixFileSystemProvider.implDelete(
> at sun.nio.fs.AbstractFileSystemProvider.delete(
> at java.nio.file.Files.delete(
> at
> ... 8 more
> Seems that sstabledump tries to delete and recreate summary file which I
> think is risky because external modification to files that should be
> modified only by Cassandra itself can lead to unpredictable behavior.
> When I copy all related files and change it's owner to myself and run
> sstabledump in that directory then Summary.db file is recreated but it's
> md5 is exactly the same as original Summary.db's file.
> I indeed have changed bloom_filter_fp_chance couple months ago, so I
> believe that's the reason why SSTableReader wants to recreate summary file.
> After nodetool scrub an error still happens.
> I have not found any issues like this in bug tracker.
> Shouldn't sstabledump be read only?
> --
> Best regards,
> Python_Max.

Re: sstable

2017-12-20 Thread Chris Lohfink
Somewhere along the line sstabledump tool incorrectly got setup to use tool
initialization, its fixed


On Tue, Dec 19, 2017 at 5:45 PM, Mounika kale <>

> Hi,
>   I'm getting below error for all sstable tools.
> sstabledump mc-173-big-Data.db
> Exception in thread "main" java.lang.ExceptionInInitializerError
> Caused by: org.apache.cassandra.exceptions.ConfigurationException:
> Expecting URI in variable: [cassandra.config]. Found[cassandra.yaml].
> Please prefix the file with [file:///] for local files and
> [file:///] for remote files. If you are executing this from an
> external tool, it needs to set Config.setClientMode(true) to avoid loading
> configuration.
> at org.apache.cassandra.config.YamlConfigurationLoader.
> getStorageConfigURL(
> at org.apache.cassandra.config.YamlConfigurationLoader.loadConfig(
> at org.apache.cassandra.config.DatabaseDescriptor.loadConfig(
> at org.apache.cassandra.config.DatabaseDescriptor.toolInitialization(
> at org.apache.cassandra.config.DatabaseDescriptor.toolInitialization(
> at

Re: gc causes C* node hang

2017-11-30 Thread Chris Lohfink
Mail client may be changing changing the char if your copy and pasting, its
- "hyphen" not the unicode en dash –. I would recommend adding it to jvm
options like oleksandr pointed out


On Thu, Nov 30, 2017 at 1:50 AM, Oleksandr Shulgin <> wrote:

> On Thu, Nov 30, 2017 at 1:38 AM, Peng Xiao <> wrote:
>> looks we are not able to enable –XX:PrintSafepointStatisticsCount=1
>> in
>> Could anyone please advise?
>> ...
>> Error: Could not find or load main class –XX:PrintSafepointStatisticsCo
>> unt=1
> Hm, not sure how are you doing it, but it boils down to adding a line
> somewhere in the like this one:
> JVM_OPTS="$JVM_OPTS -XX:PrintSafepointStatisticsCount=1"
> OR, if you're using a newer version (3.0 or newer), the following in the
> jvm.options file:
> -XX:PrintSafepointStatisticsCount=1
> Cheers,
> --
> Alex

Re: What is OneMinuteRate in Write Latency?

2017-11-03 Thread Chris Lohfink
Its from the metrics library Meter
object which tracks the exponentially weighted moving average
of events.


On Thu, Nov 2, 2017 at 12:10 PM, AI Rumman <> wrote:

> Hi,
> I am trying to calculate the Read/second and Write/Second in my Cassandra
> 2.1 cluster. After searching and reading, I came to know about JMX bean
> "org.apache.cassandra.metrics:type=ClientRequest,scope=
> Write,name=Latency".
> Here I can see oneMinuteRate. I have started a brand new cluster and
> started collected these metrics from 0.
> When I started my first record, I can see
> Count = 1
>> OneMinuteRate = 0.01599111...
> Does it mean that my write/s is 0.0159911? Or does it mean that based on 1
> minute data, my write latency is 0.01599 where Write Latency refers to the
> response time for writing a record?
> Please help me understand the value.
> Thanks.

Re: Cassandra Compaction Metrics - CompletedTasks vs TotalCompactionCompleted

2017-10-31 Thread Chris Lohfink
CompactionMetrics is a combination of the compaction executor (sstable
compactions, secondary index build, view building, relocate,
garbagecollect, cleanup, scrub etc) and validation executor (repairs). Keep
in mind not all jobs execute 1 task per operation, things that use the
parallelAllSSTableOperation like cleanup will create 1 task per sstable.

The "CompletedTasks" metric is a measure of how many tasks ran on these two
executors combined.
The "TotalCompactionsCompleted" metric is a measure of how many compactions
issued from the compaction manager ran (normal compactions, cache writes,
scrub, 2i and MVs).  So while they may be close, depending on whats
happening on the system, theres no assurance that they will be within any
bounds of each other.

So I would suspect validation compactions from repairs would be one major
difference. If you run other operational tasks there will likely be more.

On Mon, Oct 30, 2017 at 12:22 PM, Lucas Benevides <> wrote:

> Kurt,
> I apreciate your answer but I don't believe CompletedTasks count the
> "validation compactions". These are compactions that occur from repair
> operations. I am running tests on 10 cluster nodes in the same physical
> rack, with Cassandra Stress Tool and I didn't make any Repair commands. The
> tables only last for seven hours, so it is not reasonable that tens of
> thousands of these validation compactions occur per node.
> I tried to see the code and the CompletedTasks counter seems to be
> populated by a method from the class java.util.concurrent.
> ThreadPoolExecutor.
> So I really don't know what it is but surely is not the amount of
> Compaction Completed Tasks.
> Thank you
> Lucas Benevides
> 2017-10-30 8:05 GMT-02:00 kurt greaves :
>> I believe (may be wrong) that CompletedTasks counts Validation
>> compactions while TotalCompactionsCompleted does not. Considering a lot of
>> validation compactions can be created every repair it might explain the
>> difference. I'm not sure why they are named that way or work the way they
>> do. There appears to be no documentation around this in the code (what a
>> surprise) and looks like it was last touched in CASSANDRA-4009
>> , which also has
>> no useful info.
>> On 27 October 2017 at 13:48, Lucas Benevides > > wrote:
>>> Dear community,
>>> I am studying the behaviour of the Cassandra
>>> TimeWindowCompactionStragegy. To do so I am watching some metrics. Two of
>>> these metrics are important: Compaction.CompletedTasks, a gauge, and the
>>> TotalCompactionsCompleted, a Meter.
>>> According to the documentation (
>>> oc/latest/operating/metrics.html#table-metrics):
>>> Completed Taks = Number of completed compactions since server [re]start.
>>> TotalCompactionsCompleted = Throughput of completed compactions since
>>> server [re]start.
>>> As I realized, the TotalCompactionsCompleted, in the Meter object, has a
>>> counter, which I supposed would be numerically close to the CompletedTasks
>>> gauge. But they are very different, with the Completed Tasks being much
>>> higher than the TotalCompactions Completed.
>>> According to the code, in github (class
>>> Completed Taks - Number of completed compactions since server [re]start
>>> TotalCompactionsCompleted - Total number of compactions since server
>>> [re]start
>>> Can you help me and explain the difference between these two metrics, as
>>> they seem to have very distinct values, with the Completed Tasks being
>>> around 1000 times the value of the counter in TotalCompactionsCompleted.
>>> Thanks in Advance,
>>> Lucas Benevides

Re: Inter Data Center Latency calculation of a Multi DC cluster running in AWS

2017-10-17 Thread Chris Lohfink
An alternative if using >3.8 you can use the
org.apache.cassandra.metrics:type=Messaging,name=[DC]-Latency mbean where
[DC] is the name of the DC and you can get the inter DC latency per node
(to that node). This does not account for NTP drift though, just how long
it takes messages (ie mutations) take to get to a node from other DCs.


On Tue, Oct 17, 2017 at 7:18 PM, Jon Haddad <> wrote:

> I recommend figuring out the latency between your datacenters.  Cassandra
> isn’t going to be any more than that barring JVM pauses on the remote
> coordinator.
> On Oct 17, 2017, at 4:17 PM, Bill Walters <> wrote:
> Hi Everyone,
> I need some suggestions on finding the time taken for Cassandra
> replication to happen from east to west region for write and read
> operations on a multi DC cluster.
> Currently, below is our cluster setup.
> *Cassandra version:* DSE 5.0.7
> *No of Data centers:* 2 (AWS East and AWS West regions)
> *No of Nodes:* 12 nodes (6 nodes in AWS East and 6 nodes in AWS West)
> *Replication Factor:* 3 in each data center.
> *Cluster size*: Around 40 GB on each node
> Sometime, next year we have an activity where our clients are going to be
> reading only from AWS West region. The data center in AWS east will be
> available but we do not want any reads to be done on this.(Our management
> wants to know the time it takes for Cassandra to replicate from one DC to
> the other)
> Here are some options I have thought of in finding the time taken for
> Cassandra replication to happen from AWS East DC to AWS West DC.
> 1. Setup a Java client to write/read a transaction with *"Local Quorum" 
> *consistency
> level in* AWS East* region as Local data center, capture the time taken
> for this activity. Similarly use this client to perform read/write
> transaction with *"Local Quorum"* consistency level in *AWS West* region
> and capture the time. Then finally perform the same transaction with with 
> *"Each
> Quorum" *consistency level and capture the time.
> *Inter DC latency* = *Time taken for Each Quorum transaction* *-* *(Time
> taken for Local Quorum transaction in AWS East as local dc)* *-** (Time
> taken for Local Quorum transaction in AWS West as local dc)*.
> 2. Utilize the
> replication-latency-tools open source project where a Python Cassandra
> clients writes in one Data Center and other client reads in other data
> center.
> Can you please suggest if my strategies above will help in finding the
> Inter DC latency or there are other ways I need to follow.
> Thank You,
> Bill Walters.

Re: Cassandra and G1 Garbage collector stop the world event (STW)

2017-10-09 Thread Chris Lohfink
Can you share your schema and cfstats? This sounds kinda like a wide
partition, backed up compactions, or tombstone issue for it to create so
much and have issues like that so quickly with those settings.

A heap dump would be most telling but they are rather large and hard to


On Mon, Oct 9, 2017 at 8:12 AM, Gustavo Scudeler <>

> Hello,
> @kurt greaves: Have you tried CMS with that sized heap?
> Yes, for testing for testing purposes, I have 3 nodes with CMS and 3 with
> G1. The behavior is basically the same.
> *Using CMS suggested settings*
> c2hhcmVkLzIwMTcvMTAvOC8tLWdjLmxvZy4wLmN1cnJlbnQtLTE5LTAtNDk=
> *Using G1 suggested settings*
> c2hhcmVkLzIwMTcvMTAvOC8tLWdjLmxvZy4wLmN1cnJlbnQtLTE5LTExLTE3
> @Steinmaurer, Thomas If this happens in a very short very frequently and
>> depending on your allocation rate in MB/s, a combination of the G1 bug and
>> a small heap, might result going towards OOM.
> We have a really high obj allocation rate:
> Avg creation rate  622.9 mb/sec
> Avg promotion rate  18.39 mb/sec
> It could be the cause, where the GC can't keep up with this rate.
> I'm stating to think it could be some wrong configuration where Cassandra is
> configured in a way that bursts allocations in a manner that G1 can't keep
> up with.
> Any ideas?
> Best regards,
> 2017-10-09 12:44 GMT+01:00 Steinmaurer, Thomas <
>> Hi,
>> although not happening here with Cassandra (due to using CMS), we had
>> some weird problem with our server application e.g. hit by the following
>> JVM/G1 bugs:
>> (more or less  a
>> duplicate of above)
>> Especially the first, JDK-8140597, might be interesting, if you see
>> periodic humongous allocations (according to a GC log) resulting in mixed
>> GC phases being steadily interrupted due to G1 bug, thus no GC in OLD
>> regions. Humongous allocations will happen if a single (?) allocation is >
>> (region size / 2), if I remember correctly. Can’t recall the default G1
>> region size for a 12GB heap, but possibly 4MB. So, in case you are
>> allocating something larger than > 2MB, you might end up in something
>> called “humongous” allocations, spanning several G1 regions. If this
>> happens in a very short very frequently and depending on your allocation
>> rate in MB/s, a combination of the G1 bug and a small heap, might result
>> going towards OOM.
>> Possibly worth a further route for investigation.
>> Regards,
>> Thomas
>> *From:* Gustavo Scudeler []
>> *Sent:* Montag, 09. Oktober 2017 13:12
>> *To:*
>> *Subject:* Cassandra and G1 Garbage collector stop the world event (STW)
>> Hi guys,
>> We have a 6 node Cassandra Cluster under heavy utilization. We have been
>> dealing a lot with garbage collector stop the world event, which can take
>> up to 50 seconds in our nodes, in the meantime Cassandra Node is
>> unresponsive, not even accepting new logins.
>> Extra details:
>> · Cassandra Version: 3.11
>> · Heap Size = 12 GB
>> · We are using G1 Garbage Collector with default settings
>> · Nodes size: 4 CPUs 28 GB RAM
>> · All CPU cores are at 100% all the time.
>> · The G1 GC behavior is the same across all nodes.
>> The behavior remains basically:
>> 1.  Old Gen starts to fill up.
>> 2.  GC can't clean it properly without a full GC and a STW event.
>> 3.  The full GC starts to take longer, until the node is completely
>> unresponsive.
>> *Extra details and GC reports:*
>> g1-garbage-collector-stop-the-world-event-stw
>> Can someone point me what configurations or events I could check?
>> Thanks!
>> Best regards,
>> The contents of this e-mail are intended for the named addressee only. It
>> contains information that may be confidential. Unless you are the named
>> addressee or an authorized designee, you may not copy or use it, or
>> disclose it to anyone else. If you received it in error please notify us
>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>> number FN 91482h) is a company registered in Linz whose registered office
>> is at 4040 Linz, Austria, Freistädterstraße 313
>> <,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313=gmail=g>

Re: [EXTERNAL] Re: Increasing VNodes

2017-10-04 Thread Chris Lohfink
Cant you just increase segmentCount option to split it more?

On Wed, Oct 4, 2017 at 12:50 PM, Mohapatra, Kishore <> wrote:

> Thanks a lot for all of your input. We are actually using Cassandra
> reaper. But it is just splitting the ranges into 256 per node.
> But I will certainly try out splitting into smaller ranges going thru the
> system.size_estimate table.
> Thanks
> *Kishore Mohapatra*
> Principal Operations DBA
> Seattle, WA
> Email :
> *From:* Jon Haddad [] * On Behalf Of *Jon
> Haddad
> *Sent:* Wednesday, October 04, 2017 10:27 AM
> *To:* user <>
> *Subject:* [EXTERNAL] Re: Increasing VNodes
> The site (with the docs) is probably more helpful to learn about how
> reaper works:
> <>
> On Oct 4, 2017, at 9:54 AM, Chris Lohfink <> wrote:
> Increasing number of tokens will make repairs worse not better. You can
> just split the sub ranges into smaller chunks, you dont need to use vnodes
> to do that. Simple approach is to iterate through each host token range and
> split by N and repair them (ie
> cassandra_range_repair
> <>)
> To be more efficient you can grab ranges and split based on number of
> partitions in the range (ie fetch system.size_estimates and walk that) so
> you dont split empty or small ranges a ton unnecessarily, and because not
> all tables have some fixed N that is efficient.
> Using TLPs reaper
> <>
>  or
> DataStax OpsCenter's repair service is easiest solution without a lot of
> effort. Repairs are hard.
> Chris
> On Wed, Oct 4, 2017 at 11:48 AM, Jeff Jirsa <> wrote:
> You don't need to change the number of vnodes, you can manually select
> CONTAINED token subranges and pass in -st and -et (just try to pick a
> number > 2^20 that is fully contained by at least one vnode).
> On Wed, Oct 4, 2017 at 9:46 AM, Mohapatra, Kishore <
>> wrote:
> Hi,
> We are having a lot of problems in repair process. We use sub
> range repair. But most of the time, some ranges fails with streaming error
> or some other kind of error.
> So wondering if it will help if we increase the no. of VNodes from 256
> (default) to 512. But increasing the VNodes will be a lot of efforts, as it
> involves wiping out the data and bootstrapping.
> So is there any other way of splitting the range into small ranges ?
> We are using version at the moment.
> Thanks
> *Kishore Mohapatra*
> Principal Operations DBA
> Seattle, WA
> Email :

Re: Increasing VNodes

2017-10-04 Thread Chris Lohfink
Increasing number of tokens will make repairs worse not better. You can
just split the sub ranges into smaller chunks, you dont need to use vnodes
to do that. Simple approach is to iterate through each host token range and
split by N and repair them (ie  To be more efficient you
can grab ranges and split based on number of partitions in the range (ie
fetch system.size_estimates and walk that) so you dont split empty or small
ranges a ton unnecessarily, and because not all tables have some fixed N
that is efficient.

Using TLPs reaper or
DataStax OpsCenter's repair service is easiest solution without a lot of
effort. Repairs are hard.


On Wed, Oct 4, 2017 at 11:48 AM, Jeff Jirsa <> wrote:

> You don't need to change the number of vnodes, you can manually select
> CONTAINED token subranges and pass in -st and -et (just try to pick a
> number > 2^20 that is fully contained by at least one vnode).
> On Wed, Oct 4, 2017 at 9:46 AM, Mohapatra, Kishore <
>> wrote:
>> Hi,
>> We are having a lot of problems in repair process. We use sub
>> range repair. But most of the time, some ranges fails with streaming error
>> or some other kind of error.
>> So wondering if it will help if we increase the no. of VNodes from 256
>> (default) to 512. But increasing the VNodes will be a lot of efforts, as it
>> involves wiping out the data and bootstrapping.
>> So is there any other way of splitting the range into small ranges ?
>> We are using version at the moment.
>> Thanks
>> *Kishore Mohapatra*
>> Principal Operations DBA
>> Seattle, WA
>> Email :

Re: Read-/ Write Latency - Cassandra 2.1 .15 vs 3.10

2017-10-03 Thread Chris Lohfink
RecentReadLatency metrics has been deprecated for years (1.1 or 1.2) and were 
removed in 2.2. It was a very misleading metric. Instead pull from the Table's 
ReadLatency metrics from the org.apache.cassandra.metrics domain.


> On Oct 3, 2017, at 10:06 AM, Anumod Mullachery <> 
> wrote:
> Hi,  We were running splunk queries to pull read / write latency.  It's 
> working fine in 2.1.15 , but not returning result from upgraded version 3.10. 
>  The bean used in the script is as shown below.  Let me know, if any changes 
> on the functionality on 2.1.15 vs 3.10 or it replaced to some other bean.   
> perf_queries= { "org.apache.cassandra.db:type=StorageProxy" => 
> "RecentReadLatencyMicros,RecentWriteLatencyMicros", }  stage_queries= { 
> "org.apache.cassandra.request:type=*" => 
> "ActiveCount,PendingTasks,CurrentlyBlockedTasks", }  curl 
> http://localhost:8778/jolokia/read/org.apache.cassandra.db:type=StorageProxy/RecentReadLatencyMicros,RecentWriteLatencyMicros
> <http://localhost:8778/jolokia/read/org.apache.cassandra.db:type=StorageProxy/RecentReadLatencyMicros,RecentWriteLatencyMicros>
>   curl 
> http://localhost:8778/jolokia/read/org.apache.cassandra.request:type=*/ActiveCount,PendingTasks,CurrentlyBlockedTasks
> <http://localhost:8778/jolokia/read/org.apache.cassandra.request:type=*/ActiveCount,PendingTasks,CurrentlyBlockedTasks>
> ~ Thanks ~  Anumod

Re: Do not use Cassandra 3.11.0+ or Cassandra 3.0.12+

2017-09-12 Thread Chris Lohfink
Last Ive seen of it OpsCenter does not collect this metric. I don't think any 
monitoring tools do.


> On Sep 11, 2017, at 4:06 PM, CPC <> wrote:
> Hi,
> Is this bug fixed in dse 5.1.3? As I understand calling jmx getTombStoneRatio
> trigers that bug. We are using opscenter as well and do you have any idea
> whether opscenter using/calling this method?
> Thanks
> On Aug 29, 2017 6:35 AM, "Jeff Jirsa" <> wrote:
>> I shouldn't actually say I don't think it can happen on 3.0 - I haven't
>> seen this happen on 3.0 without some other code change to enable it, but
>> like I said, we're still investigating.
>> --
>> Jeff Jirsa
>>> On Aug 28, 2017, at 8:30 PM, Jeff Jirsa <> wrote:
>>> For what it's worth, I don't think this impacts 3.0 without adding some
>> other code change (the reporter of the bug on 3.0 had added custom metrics
>> that exposed a concurrency issue).
>>> We're looking at it on 3.11. I think 13038 made it far more likely to
>> occur, but I think it could have happened pre-13038 as well (would take
>> some serious luck with your deletion time distribution though - the
>> rounding in 13038 does make it more likely, but the race was already there).
>>> --
>>> Jeff Jirsa
>>>> On Aug 28, 2017, at 8:24 PM, Jay Zhuang <>
>> wrote:
>>>> We're using 3.0.12+ for a few months and haven't seen the issue like
>>>> that. Do we know what could trigger the problem? Or is 3.0.x really
>>>> impacted?
>>>> Thanks,
>>>> Jay
>>>>> On 8/28/17 6:02 AM, Hannu Kröger wrote:
>>>>> Hello,
>>>>> Current latest Cassandra version (3.11.0, possibly also 3.0.12+) has a
>> race
>>>>> condition that causes Cassandra to create broken sstables (stats file
>> in
>>>>> sstables to be precise).
>>>>> Bug described here:
>>>>> This change might be causing it (but not sure):
>>>>> Other related issues:
>>>>> I would not recommend using 3.11.0 nor upgrading to 3.0.12 or higher
>> before
>>>>> this is fixed.
>>>>> Cheers,
>>>>> Hannu
>>>> -
>>>> To unsubscribe, e-mail:
>>>> For additional commands, e-mail:
>> -
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

Re: Cassandra CF Level Metrics (Read, Write Count and Latency)

2017-09-01 Thread Chris Lohfink
To be future compatible should consider using `type=Table` instead of
depending on your version.

> not matching with the total read requests

the table level metrics for Read/Write latencies will not match the number
of requests you've made. This metric is the amount of time it took to
perform the action of the read/write locally on that node. The
`type=ClientRequests` mbeans are the ones that are at the coordinator level
including querying all the replicas and merging results etc.

The table metrics do have a name=CoordinatorReadLatency (also Scan for
range queries) mbean which may be what your looking for. Table level write
coordinator metrics are missing since the read coordinator metrics were
actually added for speculative retry so I think writes were overlooked.


On Thu, Aug 31, 2017 at 10:58 PM, Jai Bheemsen Rao Dhanwada <> wrote:

> okay, let me try it out
> On Thu, Aug 31, 2017 at 8:30 PM, Christophe Schmitz <
>> wrote:
>> Hi Jai,
>> The ReadLatency MBean expose a few metrics, including the count one,
>> which is the total read requests you are after.
>> See attached screenshot
>> Cheers,
>> Christophe
>> On 1 September 2017 at 09:21, Jai Bheemsen Rao Dhanwada <
>>> wrote:
>>> I did look at the document and tried setting up the metric as following,
>>> does this is not matching with the total read requests. I am using
>>> "ReadLatency_OneMinuteRate"
>>> /org.apache.cassandra.metrics:type=ColumnFamily,keyspace=*,s
>>> cope=*,name=ReadLatency
>>> On Thu, Aug 31, 2017 at 4:17 PM, Christophe Schmitz <
>>>> wrote:
>>>> Hello Jai,
>>>> Did you have a look at the following page:
>>>> In your case, you would want the following MBeans:
>>>> org.apache.cassandra.metrics:type=Table keyspace=
>>>> scope= name=
>>>> With MetricName set to ReadLatency and WriteLatency
>>>> Cheers,
>>>> Christophe
>>>> On 1 September 2017 at 09:08, Jai Bheemsen Rao Dhanwada <
>>>>> wrote:
>>>>> Hello All,
>>>>> I am looking to capture the CF level Read, Write count and Latency. As
>>>>> of now I am using Telegraf plugin to capture the JMX metrics.
>>>>> What is the MBeans, scope and metric to look for the CF level metrics?
>> --
>> *Christophe Schmitz*
>> *Director of consulting EMEA*AU: +61 4 03751980 / FR: +33 7 82022899
>> <+33%207%2082%2002%2028%2099>
>> <>
>> <>
>> <>
>> <>
>> Read our latest technical blog posts here
>> <>.
>> This email has been sent on behalf of Instaclustr Pty. Limited
>> (Australia) and Instaclustr Inc (USA).
>> This email and any attachments may contain confidential and legally
>> privileged information.  If you are not the intended recipient, do not copy
>> or disclose its content, but please reply to this email immediately and
>> highlight the error to the sender and then immediately delete the message.
>> -
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

Re: Cassandra - Nodes can't restart due to java.lang.OutOfMemoryError: Direct buffer memory

2017-08-31 Thread Chris Lohfink
What version of java are you running? There is a "kinda leak" in jvm around
this you may run into, can try with -Djdk.nio.maxCachedBufferSize=262144 if
above 8u102. You can also try increasing the size allowed for direct byte
buffers. It defaults to size of heap -XX:MaxDirectMemorySize=?G

Some NIO channel operations use temporary DirectByteBuffers which are
> cached in thread-local caches to avoid having to allocate / free a buffer
> at every operation.
> Unfortunately, there is no bound imposed on the size of buffers added to
> the thread-local caches. So, infrequent channel operations that require a
> very large buffer can create a native memory leak.

> *Ability to limit the capacity of buffers that can be held in the
> temporary buffer cache*The system property jdk.nio.maxCachedBufferSize has
> been introduced in 8u102 to limit the memory used by the "temporary buffer
> cache." The temporary buffer cache is a per-thread cache of direct memory
> used by the NIO implementation to support applications that do I/O with
> buffers backed by arrays in the java heap. The value of the property is the
> maximum capacity of a direct buffer that can be cached. If the property is
> not set, then no limit is put on the size of buffers that are cached.
> Applications with certain patterns of I/O usage may benefit from using this
> property. In particular, an application that does I/O with large
> multi-megabyte buffers at startup but does I/O with small buffers may see a
> benefit to using this property. Applications that do I/O using direct
> buffers will not see any benefit to using this system property.
> See JDK-8147468 <>


On Thu, Aug 31, 2017 at 4:59 AM, Jonathan Baynes <> wrote:

> I wonder if its related to this bug  (below) that’s currently unresolved,
> albeit it being reproduced way back in 2.1.11
> *From:* qf zhou []
> *Sent:* 31 August 2017 10:58
> *To:*
> *Subject:* Re: Cassandra - Nodes can't restart due to
> java.lang.OutOfMemoryError: Direct buffer memory
> I am usingCassandra 3.9 with cqlsh 5.0.1.
> 在 2017年8月31日,下午5:54,Jonathan Baynes <> 写道:
> again
> This e-mail may contain confidential and/or privileged information. If you
> are not the intended recipient (or have received this e-mail in error)
> please notify the sender immediately and destroy it. Any unauthorized
> copying, disclosure or distribution of the material in this e-mail is
> strictly forbidden. Tradeweb reserves the right to monitor all e-mail
> communications through its networks. If you do not wish to receive
> marketing emails about our products / services, please let us know by
> contacting us, either by email at or by writing to
> us at the registered office of Tradeweb in the UK, which is: Tradeweb
> Europe Limited (company number 3912826), 1 Fore Street Avenue London EC2Y
> 9DT. To see our privacy policy, visit our website @

Re: Nodetool tablehistograms

2017-07-19 Thread Chris Lohfink
Its the number of sstables that may of been read from. This includes
sstables who had their bloom filters checked (which may hit disk). This
changes a bit in to
be only the sstables that its actually reading from.

On Wed, Jul 19, 2017 at 11:04 AM, Abhinav Solan 

> Hi Everyone,
> Here is the result of my tablehistograms command on one of our tables.
> Percentile  SSTables Write Latency  Read LatencyPartition Size
>Cell Count
>   (micros)  (micros)   (bytes)
> 50% 4.00 73.46545.79152321
>  8239
> 75%10.00 88.15   2346.80379022
> 20501
> 95%10.00152.32   4055.27   1358102
> 73457
> 98%10.00219.34   4866.32   1955666
> 88148
> 99%10.00315.85   5839.59   1955666
> Min 0.00 17.09 35.4373
> 3
> Max10.00  36157.19  52066.35   2816159
> What does SSTables column represent here?
> Does it mean how many SSTables the read is spanning to?
> Thanks,
> Abhinav

Re: reduced num_token = improved performance ??

2017-07-12 Thread Chris Lohfink
Probably worth mentioning that some operational procedures like repairs,
bootstrapping etc are helped massively by using less tokens. Incremental
repairs are one of the things I would say is most impacted the by it since
less tokens will mean less local ranges to iterate through and less anti
compaction. I would highly recommend using far less than 256 in 3.x.


On Tue, Jul 11, 2017 at 8:36 PM, Justin Cameron <>

> Hi,
> Using fewer vnodes means you'll have a higher chance of hot spots in your
> cluster. Hot spots in Cassandra are nodes that, by random chance, are
> responsible for a higher percentage of the token space than others. This
> means they will receive more data and also more traffic/load than other
> nodes in the cluster.
> CASSANDRA-7032 goes a long way towards addresses this issue by allocating
> vnode tokens more intelligently, rather than just randomly assigning them.
> If you're using a version of Cassandra that contains this feature (3.0+),
> you can use a smaller number of vnodes in your cluster.
> A high number of vnodes won't affect performance for most Cassandra
> workloads, but if you're running tasks that need to do token-range scans
> (such as Spark), there is usually a significant performance hit.
> If you're on C* 3.0+ and are using Spark (or similar workloads - cassandra
> lucene index plugin is also affected) then I'd recommend using fewer vnodes
> - 16 would be ok. You'll probably still see some variance in token-space
> ownership between nodes, but the trade-off for better Spark performance
> will likely be worth it.
> Justin
> On Wed, 12 Jul 2017 at 00:34 ZAIDI, ASAD A <> wrote:
>> Hi Folks,
>> Pardon me if I’m missing  something obvious.  I’m still using
>> apache-cassandra 2.2 and planning for upgrade to  3.x.
>> I came across this jira [
>> jira/browse/CASSANDRA-7032] that suggests reducing num_token may improve
>> general performance of Cassandra like having  num_token=16 instead of 256
>>   may help!
>> Can you please suggests if having less num_token would provide real
>> performance benefits or if  it comes with any downsides that we should also
>> consider? I’ll much appreciate your insights.
>> Thank you
>> Asad
> --
> *Justin Cameron*Senior Software Engineer
> <>
> This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
> and Instaclustr Inc (USA).
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.

Re: Understanding of cassandra metrics

2017-07-07 Thread Chris Lohfink
The coordinator read/scan (Scan is just different naming for the Range, so
coordinator view of RangeLatency) is the latencies from the coordinator
perspective, so it includes network latency between replicas and such. This
which is actually added for speculative retry (why there is no
coordinatorWriteLatency). Only the CoordinatorReadLatency is used for it

The Read/RangeLatency metrics are for local reads, basically just how long
to read from disk and merge with sstables.

The View* metrics are only relevant to materialized views. There actually
is a partition lock for updates which ViewLockAcquireTime gives visibility
too. Also there are sometimes reads required for updating materialized
views, which ViewReadTime is for tracking. For more details id recommend


On Fri, Jul 7, 2017 at 9:42 AM, ZAIDI, ASAD A <> wrote:

> What exactly does mean CoordinatorScanLatency for example
> CoordinatorScanLatency  is a timer metric that present coordinator range
> scan latency for  table.
> Is it latency on full table scan or maybe range scan by clustering key?
> It is range scan.. clustering key is used to only store
> data in sorted fashion – partition key along with chosen partitioner helps
> in range scan of data.
> Can anybody write into partition while locked?
> Writes are atomic – it depends on your chosen consistency
> level to determine if writes will fail or succeed.
> *From:* Павел Сапежко []
> *Sent:* Friday, July 07, 2017 8:23 AM
> *To:*
> *Subject:* Re: Understanding of cassandra metrics
> Are you really think that I don't read docs? Do you have enough
> information in the documentation? I think no. What exactly does mean 
> CoordinatorScanLatency
> for example? Is it latency on full table scan or maybe range scan by
> clustering key? What exactly mean ViewLockAcquireTime? What is "partition
> lock"? Can anybody write into partition while locked? Etc.
> пт, 7 июл. 2017 г. в 13:01, Ivan Iliev <>:
> 1st result on google returns:
> <>
> On Fri, Jul 7, 2017 at 12:16 PM, Павел Сапежко <>
> wrote:
> Hello, I have several question about cassandra metrics. What does exactly
> mean the next metrics:
>- CoordinatorReadLatency
>- CoordinatorScanLatency
>- ReadLatency
>- RangeLatency
>- ViewLockAcquireTime
>- ViewReadTime
> --
> С уважением,
> Павел Сапежко
> skype: p.sapezhko
> --
> С уважением,
> Павел Сапежко
> skype: p.sapezhko

Re: Partition range incremental repairs

2017-06-19 Thread Chris Stokesmore
Anyone have anymore thoughts on this at all? Struggling to understand it..

> On 9 Jun 2017, at 11:32, Chris Stokesmore <> 
> wrote:
> Hi Anuj,
> Thanks for the reply.
> 1). We are using Cassandra 2.2.8, and our repair commands we are comparing 
> are 
> "nodetool repair --in-local-dc --partitioner-range” and 
> "nodetool repair --in-local-dc”
> Since 2.2 I believe inc repairs are the default - that seems to be confirmed 
> in the logs that list the repair details when a repair starts.
> 2) From looks at a few runsr, on average:
> with -pr repairs, each node is approx 6.5 - 8 hours, so a total over the 7 
> nodes of 53 hours
> With just inc repairs, each node ~26 - 29 hours, so a total of 193
> 3) we currently have two DCs in total, the ‘production’ ring with 7 nodes and 
> RF=3, and a testing ring with one single node and RF=1 for our single 
> keyspace we currently use.
> 4) Yeah that number came from the Cassandra repair logs from an inc repair, I 
> can share the number reports when using a pr repair later this evening when 
> the currently running repair has completed.
> Many thanks for the reply again,
> Chris
>> On 6 Jun 2017, at 17:50, Anuj Wadehra < 
>> <>> wrote:
>> Hi Chris,
>> Can your share following info:
>> 1. Exact repair commands you use for inc repair and pr repair
>> 2. Repair time should be measured at cluster level for inc repair. So, whats 
>> the total time it takes to run repair on all nodes for incremental vs pr 
>> repairs?
>> 3. You are repairing one dc DC3. How many DCs are there in total and whats 
>> the RF for keyspaces? Running pr on a specific dc would not repair entire 
>> data.
>> 4. 885 ranges? From where did you get this number? Logs? Can you share the 
>> number ranges printed in logs for both inc and pr case?
>> Thanks
>> Anuj
>> Sent from Yahoo Mail on Android 
>> <>
>> On Tue, Jun 6, 2017 at 9:33 PM, Chris Stokesmore
>> < <>> wrote:
>> Thank you for the excellent and clear description of the different versions 
>> of repair Anuj, that has cleared up what I expect to be happening.
>> The problem now is in our cluster, we are running repairs with options 
>> (parallelism: parallel, primary range: false, incremental: true, job 
>> threads: 1, ColumnFamilies: [], dataCenters: [DC3], hosts: [], # of ranges: 
>> 885) and when we do our repairs are taking over a day to complete when 
>> previously when running with the partition range option they were taking 
>> more like 8-9 hours.
>> As I understand it, using incremental should have sped this process up as 
>> all three sets of data on each repair job should be marked as repaired 
>> however this does not seem to be the case. Any ideas?
>> Chris
>>> On 6 Jun 2017, at 16:08, Anuj Wadehra < 
>>> <>> wrote:
>>> Hi Chris,
>>> Using pr with incremental repairs does not make sense. Primary range repair 
>>> is an optimization over full repair. If you run full repair on a n node 
>>> cluster with RF=3, you would be repairing each data thrice. 
>>> E.g. in a 5 node cluster with RF=3, a range may exist on node A,B and C . 
>>> When full repair is run on node A, the entire data in that range gets 
>>> synced with replicas on node B and C. Now, when you run full repair on 
>>> nodes B and C, you are wasting resources on repairing data which is already 
>>> repaired. 
>>> Primary range repair ensures that when you run repair on a node, it ONLY 
>>> repairs the data which is owned by the node. Thus, no node repairs data 
>>> which is not owned by it and must be repaired by other node. Redundant work 
>>> is eliminated. 
>>> Even in pr, each time you run pr on all nodes, you repair 100% of data. Why 
>>> to repair complete data in each cycle?? ..even data which has not even 
>>> changed since the last repair cycle?
>>> This is where Incremental repair comes as an improvement. Once repaired, a 
>>> data would be marked repaired so that the next repair cycle could just 
>>> focus on repairing the delta. Now, lets

Re: Partition range incremental repairs

2017-06-09 Thread Chris Stokesmore
> I can't recommend *anyone* use incremental repair as there's some pretty 
> horrible bugs in it that can cause Merkle trees to wildly mismatch & result 
> in massive overstreaming.  Check out 
> <>.  
> TL;DR: Do not use incremental repair before 4.0.

Hi Jonathan,

Thanks for your reply, this is a slightly scary message for us! 2.2 has been 
out for nearly 2 years and incremental repairs are the default - and it has 
horrible bugs!?
I guess massive over streaming while a performance issue, does not affect data 

Are there any plans to back port this to 3 or ideally 2.2 ?


> On Tue, Jun 6, 2017 at 9:54 AM Anuj Wadehra <> 
> wrote:
> Hi Chris,
> Can your share following info:
> 1. Exact repair commands you use for inc repair and pr repair
> 2. Repair time should be measured at cluster level for inc repair. So, whats 
> the total time it takes to run repair on all nodes for incremental vs pr 
> repairs?
> 3. You are repairing one dc DC3. How many DCs are there in total and whats 
> the RF for keyspaces? Running pr on a specific dc would not repair entire 
> data.
> 4. 885 ranges? From where did you get this number? Logs? Can you share the 
> number ranges printed in logs for both inc and pr case?
> Thanks
> Anuj
> Sent from Yahoo Mail on Android 
> <>
> On Tue, Jun 6, 2017 at 9:33 PM, Chris Stokesmore
> < <>> wrote:
> Thank you for the excellent and clear description of the different versions 
> of repair Anuj, that has cleared up what I expect to be happening.
> The problem now is in our cluster, we are running repairs with options 
> (parallelism: parallel, primary range: false, incremental: true, job threads: 
> 1, ColumnFamilies: [], dataCenters: [DC3], hosts: [], # of ranges: 885) and 
> when we do our repairs are taking over a day to complete when previously when 
> running with the partition range option they were taking more like 8-9 hours.
> As I understand it, using incremental should have sped this process up as all 
> three sets of data on each repair job should be marked as repaired however 
> this does not seem to be the case. Any ideas?
> Chris
>> On 6 Jun 2017, at 16:08, Anuj Wadehra < 
>> <>> wrote:
>> Hi Chris,
>> Using pr with incremental repairs does not make sense. Primary range repair 
>> is an optimization over full repair. If you run full repair on a n node 
>> cluster with RF=3, you would be repairing each data thrice. 
>> E.g. in a 5 node cluster with RF=3, a range may exist on node A,B and C . 
>> When full repair is run on node A, the entire data in that range gets synced 
>> with replicas on node B and C. Now, when you run full repair on nodes B and 
>> C, you are wasting resources on repairing data which is already repaired. 
>> Primary range repair ensures that when you run repair on a node, it ONLY 
>> repairs the data which is owned by the node. Thus, no node repairs data 
>> which is not owned by it and must be repaired by other node. Redundant work 
>> is eliminated. 
>> Even in pr, each time you run pr on all nodes, you repair 100% of data. Why 
>> to repair complete data in each cycle?? ..even data which has not even 
>> changed since the last repair cycle?
>> This is where Incremental repair comes as an improvement. Once repaired, a 
>> data would be marked repaired so that the next repair cycle could just focus 
>> on repairing the delta. Now, lets go back to the example of 5 node cluster 
>> with RF =3.This time we run incremental repair on all nodes. When you repair 
>> entire data on node A, all 3 replicas are marked as repaired. Even if you 
>> run inc repair on all ranges on the second node, you would not re-repair the 
>> already repaired data. Thus, there is no advantage of repairing only the 
>> data owned by the node (primary range of the node). You can run inc repair 
>> on all the data present on a node and Cassandra would make sure that when 
>> you repair data on other nodes, you only repair unrepaired data.
>> Thanks
>> Anuj
>> Sent from Yahoo Mail on Android 
>> <>
>> On Tue, Jun 6, 2017 at 4:27 PM, Chris Stokesmore
>> <chris.elsm...@

Re: Partition range incremental repairs

2017-06-09 Thread Chris Stokesmore
Hi Anuj,

Thanks for the reply.

1). We are using Cassandra 2.2.8, and our repair commands we are comparing are 
"nodetool repair --in-local-dc --partitioner-range” and 
"nodetool repair --in-local-dc”
Since 2.2 I believe inc repairs are the default - that seems to be confirmed in 
the logs that list the repair details when a repair starts.

2) From looks at a few runsr, on average:
with -pr repairs, each node is approx 6.5 - 8 hours, so a total over the 7 
nodes of 53 hours
With just inc repairs, each node ~26 - 29 hours, so a total of 193

3) we currently have two DCs in total, the ‘production’ ring with 7 nodes and 
RF=3, and a testing ring with one single node and RF=1 for our single keyspace 
we currently use.

4) Yeah that number came from the Cassandra repair logs from an inc repair, I 
can share the number reports when using a pr repair later this evening when the 
currently running repair has completed.

Many thanks for the reply again,


> On 6 Jun 2017, at 17:50, Anuj Wadehra <> wrote:
> Hi Chris,
> Can your share following info:
> 1. Exact repair commands you use for inc repair and pr repair
> 2. Repair time should be measured at cluster level for inc repair. So, whats 
> the total time it takes to run repair on all nodes for incremental vs pr 
> repairs?
> 3. You are repairing one dc DC3. How many DCs are there in total and whats 
> the RF for keyspaces? Running pr on a specific dc would not repair entire 
> data.
> 4. 885 ranges? From where did you get this number? Logs? Can you share the 
> number ranges printed in logs for both inc and pr case?
> Thanks
> Anuj
> Sent from Yahoo Mail on Android 
> <>
> On Tue, Jun 6, 2017 at 9:33 PM, Chris Stokesmore
> <> wrote:
> Thank you for the excellent and clear description of the different versions 
> of repair Anuj, that has cleared up what I expect to be happening.
> The problem now is in our cluster, we are running repairs with options 
> (parallelism: parallel, primary range: false, incremental: true, job threads: 
> 1, ColumnFamilies: [], dataCenters: [DC3], hosts: [], # of ranges: 885) and 
> when we do our repairs are taking over a day to complete when previously when 
> running with the partition range option they were taking more like 8-9 hours.
> As I understand it, using incremental should have sped this process up as all 
> three sets of data on each repair job should be marked as repaired however 
> this does not seem to be the case. Any ideas?
> Chris
>> On 6 Jun 2017, at 16:08, Anuj Wadehra < 
>> <>> wrote:
>> Hi Chris,
>> Using pr with incremental repairs does not make sense. Primary range repair 
>> is an optimization over full repair. If you run full repair on a n node 
>> cluster with RF=3, you would be repairing each data thrice. 
>> E.g. in a 5 node cluster with RF=3, a range may exist on node A,B and C . 
>> When full repair is run on node A, the entire data in that range gets synced 
>> with replicas on node B and C. Now, when you run full repair on nodes B and 
>> C, you are wasting resources on repairing data which is already repaired. 
>> Primary range repair ensures that when you run repair on a node, it ONLY 
>> repairs the data which is owned by the node. Thus, no node repairs data 
>> which is not owned by it and must be repaired by other node. Redundant work 
>> is eliminated. 
>> Even in pr, each time you run pr on all nodes, you repair 100% of data. Why 
>> to repair complete data in each cycle?? ..even data which has not even 
>> changed since the last repair cycle?
>> This is where Incremental repair comes as an improvement. Once repaired, a 
>> data would be marked repaired so that the next repair cycle could just focus 
>> on repairing the delta. Now, lets go back to the example of 5 node cluster 
>> with RF =3.This time we run incremental repair on all nodes. When you repair 
>> entire data on node A, all 3 replicas are marked as repaired. Even if you 
>> run inc repair on all ranges on the second node, you would not re-repair the 
>> already repaired data. Thus, there is no advantage of repairing only the 
>> data owned by the node (primary range of the node). You can run inc repair 
>> on all the data present on a node and Cassandra would make sure that when 
>> you repair data on other nodes, you only repair unrepaired data.
>> Thanks
>> Anuj
>> Sent from Ya

Re: Partition range incremental repairs

2017-06-06 Thread Chris Stokesmore
Thank you for the excellent and clear description of the different versions of 
repair Anuj, that has cleared up what I expect to be happening.

The problem now is in our cluster, we are running repairs with options 
(parallelism: parallel, primary range: false, incremental: true, job threads: 
1, ColumnFamilies: [], dataCenters: [DC3], hosts: [], # of ranges: 885) and 
when we do our repairs are taking over a day to complete when previously when 
running with the partition range option they were taking more like 8-9 hours.

As I understand it, using incremental should have sped this process up as all 
three sets of data on each repair job should be marked as repaired however this 
does not seem to be the case. Any ideas?


> On 6 Jun 2017, at 16:08, Anuj Wadehra <> wrote:
> Hi Chris,
> Using pr with incremental repairs does not make sense. Primary range repair 
> is an optimization over full repair. If you run full repair on a n node 
> cluster with RF=3, you would be repairing each data thrice. 
> E.g. in a 5 node cluster with RF=3, a range may exist on node A,B and C . 
> When full repair is run on node A, the entire data in that range gets synced 
> with replicas on node B and C. Now, when you run full repair on nodes B and 
> C, you are wasting resources on repairing data which is already repaired. 
> Primary range repair ensures that when you run repair on a node, it ONLY 
> repairs the data which is owned by the node. Thus, no node repairs data which 
> is not owned by it and must be repaired by other node. Redundant work is 
> eliminated. 
> Even in pr, each time you run pr on all nodes, you repair 100% of data. Why 
> to repair complete data in each cycle?? ..even data which has not even 
> changed since the last repair cycle?
> This is where Incremental repair comes as an improvement. Once repaired, a 
> data would be marked repaired so that the next repair cycle could just focus 
> on repairing the delta. Now, lets go back to the example of 5 node cluster 
> with RF =3.This time we run incremental repair on all nodes. When you repair 
> entire data on node A, all 3 replicas are marked as repaired. Even if you run 
> inc repair on all ranges on the second node, you would not re-repair the 
> already repaired data. Thus, there is no advantage of repairing only the data 
> owned by the node (primary range of the node). You can run inc repair on all 
> the data present on a node and Cassandra would make sure that when you repair 
> data on other nodes, you only repair unrepaired data.
> Thanks
> Anuj
> Sent from Yahoo Mail on Android 
> <>
> On Tue, Jun 6, 2017 at 4:27 PM, Chris Stokesmore
> <> wrote:
> Hi all,
> Wondering if anyone had any thoughts on this? At the moment the long running 
> repairs cause us to be running them on two nodes at once for a bit of time, 
> which obivould increases the cluster load.
> On 2017-05-25 16:18 (+0100), Chris Stokesmore < 
> <>> wrote: 
> > Hi,> 
> > 
> > We are running a 7 node Cassandra 2.2.8 cluster, RF=3, and had been running 
> > repairs with the -pr option, via a cron job that runs on each node once per 
> > week.> 
> > 
> > We changed that as some advice on the Cassandra IRC channel said it would 
> > cause more anticompaction and  
> >
> >   
> > <>says
> >  'Performing partitioner range repairs by using the -pr option is generally 
> > considered a good choice for doing manual repairs. However, this option 
> > cannot be used with incremental repairs (default for Cassandra 2.2 and 
> > later)'
> > 
> > Only problem is our -pr repairs were taking about 8 hours, and now the 
> > non-pr repair are taking 24+ - I guess this makes sense, repairing 1/7 of 
> > data increased to 3/7, except I was hoping to see a speed up after the 
> > first loop through the cluster as each repair will be marking much more 
> > data as repaired, right?> 
> > 
> > 
> > Is running -pr with incremental repairs really that bad? > 
> -
> To unsubscribe, e-mail: 
> <>
> For additional commands, e-mail: 
> <>

Re: Partition range incremental repairs

2017-06-06 Thread Chris Stokesmore
Hi all,

Wondering if anyone had any thoughts on this? At the moment the long running 
repairs cause us to be running them on two nodes at once for a bit of time, 
which obivould increases the cluster load.

On 2017-05-25 16:18 (+0100), Chris Stokesmore <> wrote: 
> Hi,> 
> We are running a 7 node Cassandra 2.2.8 cluster, RF=3, and had been running 
> repairs with the -pr option, via a cron job that runs on each node once per 
> week.> 
> We changed that as some advice on the Cassandra IRC channel said it would 
> cause more anticompaction and  
>   says 'Performing partitioner range repairs by using the -pr option is 
> generally considered a good choice for doing manual repairs. However, this 
> option cannot be used with incremental repairs (default for Cassandra 2.2 and 
> later)'
> Only problem is our -pr repairs were taking about 8 hours, and now the non-pr 
> repair are taking 24+ - I guess this makes sense, repairing 1/7 of data 
> increased to 3/7, except I was hoping to see a speed up after the first loop 
> through the cluster as each repair will be marking much more data as 
> repaired, right?> 
> Is running -pr with incremental repairs really that bad? > 
To unsubscribe, e-mail:
For additional commands, e-mail:

Partition range incremental repairs

2017-05-25 Thread Chris Stokesmore

We are running a 7 node Cassandra 2.2.8 cluster, RF=3, and had been running 
repairs with the —pr option, via a cron job that runs on each node once per 

We changed that as some advice on the Cassandra IRC channel said it would cause 
more anticompaction and

 says “Performing partitioner range repairs by using the -pr option is 
generally considered a good choice for doing manual repairs. However, this 
option cannot be used with incremental repairs (default for Cassandra 2.2 and 

Only problem is our -pr repairs were taking about 8 hours, and now the non-pr 
repair are taking 24+ - I guess this makes sense, repairing 1/7 of data 
increased to 3/7, except I was hoping to see a speed up after the first loop 
through the cluster as each repair will be marking much more data as repaired, 

Is running -pr with incremental repairs really that bad? 

Re: what is MemtableReclaimMemory mean ??

2017-05-01 Thread Chris Lohfink
Question though, how many tables do you have? If you have more than a few
hundreds it could be bottlenecking the flushing if it is flushing very

On Mon, May 1, 2017 at 9:32 PM, Chris Lohfink <> wrote:

> Theres a read barrier to stop reclaiming a memtable when there are
> requests actively reading it. The *MemtableReclaimMemory* pool offloads
> that wait instead of blocking the caller. It in itself is not going to use
> any cpu or increase load. It will however block the releasing of the
> memtable resources which might cause additional heap allocation pressure.
> Its more likely a symptom of GCs or reads being slow than the cause of the
> issue however.
> Chris
> On Mon, May 1, 2017 at 9:01 PM, Pranay akula <>
> wrote:
>> Hi Alain,
>> when  "*MemtableReclaimMemory*"  Pending Tasks increasing, its slowly
>> backing up reads and writes mostly writes. yes i am seeing bit high GC
>> pressure, currently we are using 24Gb Heap  and G1GC collection. I tried
>> changing Memtable flush threshold it did helped a little but not much. I am
>> not seeing any Errors in the Logs.
>> Thanks
>> Pranay.
>> On Thu, Apr 27, 2017 at 6:08 AM, Alain RODRIGUEZ <>
>> wrote:
>>> Hi Pranay,
>>> According to
>>> ssandra/3.0/cassandra/tools/toolsTPstats.html, "*MemtableReclaimMemory*"
>>> is the thread pool used for "Making unused memory available". I don't know
>>> much about it since it was never an issue for me. Neither did I heard much
>>> about it.
>>>- Are pending tasks staying high for a long period? `watch -d
>>>nodetool tpstats`
>>>- What are your GC settings?
>>>- Any other threads pending, blocked or dropped?
>>>- Do you have errors or warnings in your logs?
>>>- Any GC pressure? (monitored through charts or logs at INFO level,
>>>or WARN on recent versions)
>>> C*heers,
>>> ---
>>> Alain Rodriguez - @arodream -
>>> France
>>> The Last Pickle - Apache Cassandra Consulting
>>> 2017-04-16 16:04 GMT+02:00 Pranay akula <>:
>>>> Hi,
>>>> what is *MemtableReclaimMemory* mean in nodetooltpstats ?? does this
>>>> mean trying to flushing memtable from memory to SStables.
>>>> I can see sometimes increase in pending tasks of  MemtableReclaimMemory
>>>> in nodetool tpstats, at that time i can see increase in load on those 
>>>> nodes.
>>>> Does decreasing memtable_cleanup_threshold will help ??
>>>> Thanks
>>>> Pranay.

Re: what is MemtableReclaimMemory mean ??

2017-05-01 Thread Chris Lohfink
Theres a read barrier to stop reclaiming a memtable when there are requests
actively reading it. The *MemtableReclaimMemory* pool offloads that wait
instead of blocking the caller. It in itself is not going to use any cpu or
increase load. It will however block the releasing of the memtable
resources which might cause additional heap allocation pressure. Its more
likely a symptom of GCs or reads being slow than the cause of the issue


On Mon, May 1, 2017 at 9:01 PM, Pranay akula <>

> Hi Alain,
> when  "*MemtableReclaimMemory*"  Pending Tasks increasing, its slowly
> backing up reads and writes mostly writes. yes i am seeing bit high GC
> pressure, currently we are using 24Gb Heap  and G1GC collection. I tried
> changing Memtable flush threshold it did helped a little but not much. I am
> not seeing any Errors in the Logs.
> Thanks
> Pranay.
> On Thu, Apr 27, 2017 at 6:08 AM, Alain RODRIGUEZ <>
> wrote:
>> Hi Pranay,
>> According to
>> olsTPstats.html, "*MemtableReclaimMemory*" is the thread pool used for
>> "Making unused memory available". I don't know much about it since it was
>> never an issue for me. Neither did I heard much about it.
>>- Are pending tasks staying high for a long period? `watch -d
>>nodetool tpstats`
>>- What are your GC settings?
>>- Any other threads pending, blocked or dropped?
>>- Do you have errors or warnings in your logs?
>>- Any GC pressure? (monitored through charts or logs at INFO level,
>>or WARN on recent versions)
>> C*heers,
>> ---
>> Alain Rodriguez - @arodream -
>> France
>> The Last Pickle - Apache Cassandra Consulting
>> 2017-04-16 16:04 GMT+02:00 Pranay akula <>:
>>> Hi,
>>> what is *MemtableReclaimMemory* mean in nodetooltpstats ?? does this
>>> mean trying to flushing memtable from memory to SStables.
>>> I can see sometimes increase in pending tasks of  MemtableReclaimMemory
>>> in nodetool tpstats, at that time i can see increase in load on those nodes.
>>> Does decreasing memtable_cleanup_threshold will help ??
>>> Thanks
>>> Pranay.

Re: How can I efficiently export the content of my table to KAFKA

2017-04-27 Thread Chris Stromberger

On Wed, Apr 26, 2017 at 2:49 PM, Tobias Eriksson <> wrote:

> Hi
> I would like to make a dump of the database, in JSON format, to KAFKA
> The database contains lots of data, millions and in some cases billions of
> “rows”
> I will provide the customer with an export of the data, where they can
> read it off of a KAFKA topic
> My thinking was to have it scalable such that I will distribute the token
> range of all available partition-keys to a number of (N) processes
> (JSON-Producers)
> First I will have a process which will read through the available tokens
> and then publish them on a KAFKA “Coordinator” Topic
> And then I can create 1, 10, 20 or N processes that will act as Producers
> to the real KAFKA topic, and pick available tokens/partition-keys off of
> the “Coordinator” Topic
> One by one until all the “rows” have been processed.
> So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert
> them into my own JSON format and post to KAFKA
> And then after that take another 1000 “rows” and then …. And then another
> 1000 “rows” and so on, until it is done.
> I base my idea on how I believe Apache Spark Connector accomplishes data
> locality, i.e. being aware of where tokens reside and figured that since
> that is possible it should be possible to create a job-list in a KAFKA
> topic, and have each Producer pick jobs from there, and read up data from
> Cassandra based on the partition key (token) and then post the JSON on the
> export KAFKA topic.
> Would you consider this a good idea ?
> Would there in fact be a better idea, what would that be then ?
> -Tobias

Re: Node always dieing

2017-04-10 Thread Chris Mawata
.SimpleSeedProvider{seeds=,,,,, 10.100.1000.213};

Why do you have all six of your nodes as seeds? is it possible that the
last one you added used itself as the seed and is isolated?

On Thu, Apr 6, 2017 at 6:48 AM, Cogumelos Maravilha <> wrote:

> Yes C* is running as cassandra:
> cassand+  2267 1 99 10:18 ?00:02:56 java
> -Xloggc:/var/log/cassandra/gc.log -ea -XX:+UseThreadPriorities
> -XX:Threa...
> INFO  [main] 2017-04-06 10:35:42,956 - Node
> configuration:[allocate_tokens_for_keyspace=null; 
> authenticator=PasswordAuthenticator;
> authorizer=CassandraAuthorizer; auto_bootstrap=true; auto_snapshot=true;
> back_pressure_enabled=false; back_pressure_strategy=org.
>{high_ratio=0.9, factor=5,
> flow=FAST}; batch_size_fail_threshold_in_kb=50;
> batch_size_warn_threshold_in_kb=5; batchlog_replay_throttle_in_kb=1024;
> broadcast_address=null; broadcast_rpc_address=null; 
> buffer_pool_use_heap_if_exhausted=true;
> cas_contention_timeout_in_ms=600; cdc_enabled=false;
> cdc_free_space_check_interval_ms=250; cdc_raw_directory=null;
> cdc_total_space_in_mb=0; client_encryption_options=;
> cluster_name=company; column_index_cache_size_in_kb=2;
> column_index_size_in_kb=64; commit_failure_policy=ignore;
> commitlog_compression=null; commitlog_directory=/mnt/cassandra/commitlog;
> commitlog_max_compression_buffers_in_pool=3;
> commitlog_periodic_queue_size=-1; commitlog_segment_size_in_mb=32;
> commitlog_sync=periodic; commitlog_sync_batch_window_in_ms=NaN;
> commitlog_sync_period_in_ms=1; commitlog_total_space_in_mb=null;
> compaction_large_partition_warning_threshold_mb=100;
> compaction_throughput_mb_per_sec=16; concurrent_compactors=null;
> concurrent_counter_writes=32; concurrent_materialized_view_writes=32;
> concurrent_reads=32; concurrent_replicates=null; concurrent_writes=32;
> counter_cache_keys_to_save=2147483647; counter_cache_save_period=7200;
> counter_cache_size_in_mb=null; counter_write_request_timeout_in_ms=600;
> credentials_cache_max_entries=1000; credentials_update_interval_in_ms=-1;
> credentials_validity_in_ms=2000; cross_node_timeout=false;
> data_file_directories=[Ljava.lang.String;@223f3642;
> disk_access_mode=auto; disk_failure_policy=ignore;
> disk_optimization_estimate_percentile=0.95; 
> disk_optimization_page_cross_chance=0.1;
> disk_optimization_strategy=ssd; dynamic_snitch=true;
> dynamic_snitch_badness_threshold=0.1; 
> dynamic_snitch_reset_interval_in_ms=60;
> dynamic_snitch_update_interval_in_ms=100; 
> enable_scripted_user_defined_functions=false;
> enable_user_defined_functions=false; 
> enable_user_defined_functions_threads=true;
> encryption_options=null; endpoint_snitch=SimpleSnitch;
> file_cache_size_in_mb=null; gc_log_threshold_in_ms=200;
> gc_warn_threshold_in_ms=1000; hinted_handoff_disabled_datacenters=[];
> hinted_handoff_enabled=true; hinted_handoff_throttle_in_kb=1024;
> hints_compression=null; hints_directory=/mnt/cassandra/hints;
> hints_flush_period_in_ms=1; incremental_backups=false;
> index_interval=null; index_summary_capacity_in_mb=null;
> index_summary_resize_interval_in_minutes=60; initial_token=null;
> inter_dc_stream_throughput_outbound_megabits_per_sec=200;
> inter_dc_tcp_nodelay=false; internode_authenticator=null;
> internode_compression=dc; internode_recv_buff_size_in_bytes=0;
> internode_send_buff_size_in_bytes=0; key_cache_keys_to_save=2147483647;
> key_cache_save_period=14400; key_cache_size_in_mb=null;
> listen_address=; listen_interface=null;
> listen_interface_prefer_ipv6=false; listen_on_broadcast_address=false;
> max_hint_window_in_ms=1080; max_hints_delivery_threads=2;
> max_hints_file_size_in_mb=128; max_mutation_size_in_kb=null;
> max_streaming_retries=3; max_value_size_in_mb=256;
> memtable_allocation_type=heap_buffers; memtable_cleanup_threshold=null;
> memtable_flush_writers=0; memtable_heap_space_in_mb=null;
> memtable_offheap_space_in_mb=null; min_free_space_per_drive_in_mb=50;
> native_transport_max_concurrent_connections=-1; native_transport_max_
> concurrent_connections_per_ip=-1; native_transport_max_frame_size_in_mb=256;
> native_transport_max_threads=128; native_transport_port=9042;
> native_transport_port_ssl=null; num_tokens=256; 
> otc_coalescing_strategy=TIMEHORIZON;
> otc_coalescing_window_us=200; 
> partitioner=org.apache.cassandra.dht.Murmur3Partitioner;
> permissions_cache_max_entries=1000; permissions_update_interval_in_ms=-1;
> permissions_validity_in_ms=2000; phi_convict_threshold=8.0;
> prepared_statements_cache_size_mb=null; range_request_timeout_in_ms=600;
> read_request_timeout_in_ms=600; request_scheduler=org.apache.
> cassandra.scheduler.NoScheduler; request_scheduler_id=null;
> request_scheduler_options=null; request_timeout_in_ms=600;
> role_manager=CassandraRoleManager; 

Re: system_auth replication strategy

2017-04-01 Thread Chris Lohfink
You should use a network topology strategy with high RF in each DC or something 
like the everywhere strategy.

You should never really use SimpleStrategy, especially if you have multiple DCs 
and are using LOCAL or EACH consistencies. Its more for test and dev setups 
then a prod environment.

The problem is that it DOES ensure a LOCAL consistency level will be targeted 
in the same DC as the coordinator but it doesn ensure there will be data on 
each DC. This means that if there are no nodes in your DC that are a replica of 
the data you can get an unavailable exception even if 100% of your nodes are up 
and healthy.

So if you have 10 nodes, 5 per dc, 2 dcs and a RF of 3 with simple strategy

[0] [10] [30] [40] [45]

[1] [11] [15] [21] [41]

Especially if you have random token assignment like above. A partition with a 
token of 11 can end up on

[ ] [ ] [ ] [ ] [ ]

[ ] [*] [*] [*] [ ]

In which case an insert or read to a node on DC1 with LOCAL_ONE or LOCAL_QUORUM 
will result in an unavailable exception.


> On Apr 1, 2017, at 10:51 AM, Vlad <> wrote:
> Hi,
> what is the suitable replication strategy for system_auth keyspace?
> As I understand factor should be equal to total nodes number, so can we use 
> SimpleStrategy? Does it ensure that queries with LOCAL_ONE consistency level 
> will be targeted to local DC (or the same node)?
> Thanks.

Re: nodes are always out of sync

2017-04-01 Thread Chris Lohfink
Repairs do not have an ability to instantly build a perfect view of its
data between your 3 nodes at an exact time. When a piece of data is written
there is a delay between when they applied between the nodes, even if its
just 500ms. So if a request to read the data and build the merkle tree of
the data occurs and it finishes on node1 at 12:01 while node2 finishes at
12:02 the 1 minute or so delta (even if a few seconds, or if using snapshot
repairs) between the partition/range hashes in the merkle tree can be
different. On a moving data set its almost impossible to have the clusters
perfectly in sync for a repair. I wouldnt worry about that log message. If
you are worried about consistency between your read/writes use each or
local quorum for both.


On Thu, Mar 30, 2017 at 1:22 AM, Roland Otta <>

> hi,
> we see the following behaviour in our environment:
> cluster consists of 6 nodes (cassandra version 3.0.7). keyspace has a
> replication factor 3.
> clients are writing data to the keyspace with consistency one.
> we are doing parallel, incremental repairs with cassandra reaper.
> even if a repair just finished and we are starting a new one
> immediately, we can see the following entries in our logs:
> INFO  [RepairJobTask:1] 2017-03-30 10:14:00,782 -
> [repair #d0f651f6-1520-11e7-a443-d9f5b942818e] Endpoints /
> and / have 1 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:2] 2017-03-30 10:14:00,782 -
> [repair #d0f651f6-1520-11e7-a443-d9f5b942818e] Endpoints /
> and / have 1 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:4] 2017-03-30 10:14:00,782 -
> [repair #d0f651f6-1520-11e7-a443-d9f5b942818e] Endpoints /
> and / have 1 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:2] 2017-03-30 10:14:03,997 -
> [repair #d0fa70a1-1520-11e7-a443-d9f5b942818e] Endpoints /
> and / have 2 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:1] 2017-03-30 10:14:03,997 -
> [repair #d0fa70a1-1520-11e7-a443-d9f5b942818e] Endpoints /
> and / have 2 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:4] 2017-03-30 10:14:03,997 -
> [repair #d0fa70a1-1520-11e7-a443-d9f5b942818e] Endpoints /
> and / have 2 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:1] 2017-03-30 10:14:05,375 -
> [repair #d0fbd033-1520-11e7-a443-d9f5b942818e] Endpoints /
> and / have 1 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:2] 2017-03-30 10:14:05,375 -
> [repair #d0fbd033-1520-11e7-a443-d9f5b942818e] Endpoints /
> and / have 1 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:4] 2017-03-30 10:14:05,375 -
> [repair #d0fbd033-1520-11e7-a443-d9f5b942818e] Endpoints /
> and / have 1 range(s) out of sync for ad_event_history
> we cant see any hints on the systems ... so we thought everything is
> running smoothly with the writes.
> do we have to be concerned about the nodes always being out of sync or
> is this a normal behaviour in a write intensive table (as the tables
> will never be 100% in sync for the latest inserts)?
> bg,
> roland

Re: partition sizes reported by nodetool tablehistograms

2017-02-24 Thread Chris Lohfink
Its the decompressed size of the partitions. Each sstable has stats
component that contains histograms for the size and number of columns in
the partitions (among other things, can see with sstablemetadata tool),
tablehistograms merges it for each sstable and gives the results.


On Fri, Feb 24, 2017 at 4:53 PM, John Sanda <> wrote:

> I am working on some issues involving really big partitions. I have been
> making extensive use of nodetool tablehistograms. What exactly is the
> partition size being reported? I have a table for which the max value
> reported is about 3.5 GB, but running du -h against the table data
> directory reports 548 MB. Are the partition sizes reported by
> tablehistograms the decompressed size on disk?
> - John

Re: Backups eating up disk space

2017-01-15 Thread Chris Mawata
You don't have a viable solution because you are not making a snapshot as a
starting point. After a while you will have a lot of backup data.  Using
the backups to get your cluster to a given state will involve copying a
very large amount of backup data, possibility more than the capacity of
your cluster followed by a tremendous amount of compaction. If your
topology changes life could really get miserable. I would counsel having
period snapshots so that your possible bad day in the future is less bad.
On Jan 13, 2017 8:01 AM, "Kunal Gangakhedkar" 

> Great, thanks a lot to all for the help :)
> I finally took the dive and went with Razi's suggestions.
> In summary, this is what I did:
>- turn off incremental backups on each of the nodes in rolling fashion
>- remove the 'backups' directory from each keyspace on each node.
> This ended up freeing up almost 350GB on each node - yay :)
> Again, thanks a lot for the help, guys.
> Kunal
> On 12 January 2017 at 21:15, Khaja, Raziuddin (NIH/NLM/NCBI) [C] <
>> wrote:
>> snapshots are slightly different than backups.
>> In my explanation of the hardlinks created in the backups folder, notice
>> that compacted sstables, never end up in the backups folder.
>> On the other hand, a snapshot is meant to represent the data at a
>> particular moment in time. Thus, the snapshots directory contains hardlinks
>> to all active sstables at the time the snapshot was taken, which would
>> include: compacted sstables; and any sstables from memtable flush or
>> streamed from other nodes that both exist in the table directory and the
>> backups directory.
>> So, that would be the difference between snapshots and backups.
>> Best regards,
>> -Razi
>> *From: *Alain RODRIGUEZ 
>> *Reply-To: *"" 
>> *Date: *Thursday, January 12, 2017 at 9:16 AM
>> *To: *"" 
>> *Subject: *Re: Backups eating up disk space
>> My 2 cents,
>> As I mentioned earlier, we're not currently using snapshots - it's only
>> the backups that are bothering me right now.
>> I believe backups folder is just the new name for the previously called
>> snapshots folder. But I can be completely wrong, I haven't played that much
>> with snapshots in new versions yet.
>> Anyway, some operations in Apache Cassandra can trigger a snapshot:
>> - Repair (when not using parallel option but sequential repairs instead)
>> - Truncating a table (by default)
>> - Dropping a table (by default)
>> - Maybe other I can't think of... ?
>> If you want to clean space but still keep a backup you can run:
>> "nodetool clearsnapshots"
>> "nodetool snapshot "
>> This way and for a while, data won't be taking space as old files will be
>> cleaned and new files will be only hardlinks as detailed above. Then you
>> might want to work at a proper backup policy, probably implying getting
>> data out of production server (a lot of people uses S3 or similar
>> services). Or just do that from time to time, meaning you only keep a
>> backup and disk space behaviour will be hard to predict.
>> C*heers,
>> ---
>> Alain Rodriguez - @arodream -
>> France
>> The Last Pickle - Apache Cassandra Consulting
>> 2017-01-12 6:42 GMT+01:00 Prasenjit Sarkar :
>> Hi Kunal,
>> Razi's post does give a very lucid description of how cassandra manages
>> the hard links inside the backup directory.
>> Where it needs clarification is the following:
>> --> incremental backups is a system wide setting and so its an all or
>> nothing approach
>> --> as multiple people have stated, incremental backups do not create
>> hard links to compacted sstables. however, this can bloat the size of your
>> backups
>> --> again as stated, it is a general industry practice to place backups
>> in a different secondary storage location than the main production site. So
>> best to move it to the secondary storage before applying rm on the backups
>> folder
>> In my experience with production clusters, managing the backups folder
>> across multiple nodes can be painful if the objective is to ever recover
>> data. With the usual disclaimers, better to rely on third party vendors to
>> accomplish the needful rather than scripts/tablesnap.
>> Regards
>> Prasenjit
>> On Wed, Jan 11, 2017 at 7:49 AM, Khaja, Raziuddin (NIH/NLM/NCBI) [C] <
>>> wrote:
>> Hello Kunal,
>> Caveat: I am not a super-expert on Cassandra, but it helps to explain to
>> others, in order to eventually become an expert, so if my explanation is

Re: Help

2017-01-09 Thread Chris Lohfink
Do you have any monitoring setup around garbage collections?  A GC +
network latency > write timeout will cause intermittent hints.

On Sun, Jan 8, 2017 at 10:30 PM, Anshu Vajpayee 

> Gossip shows - all nodes are up.
> But when  we perform writes , coordinator stores the hints. It means  -
> coordinator was not able to deliver the writes to few nodes after meeting
> consistency requirements.
> The nodes for which  writes were failing, are in different DC. Those nodes
> do not have any load.
> Gossips shows everything is up.  I already set write timeout to 60 sec,
> but no help.
> Can anyone encounter this scenario ? Network side everything is fine.
> Cassandra version is 2.1.13
> --
> *Regards,*
> *Anshu *

Re: Java GC pauses, reality check

2016-11-25 Thread Chris Lohfink
No tuning will eliminate gcs.

20-30 seconds is horrific and out of the ordinary. Most likely implementing
antipatterns and/or poorly configured. Sub 1s is realistic but with some
workloads still may require some tuning to maintain. Some workloads are
very unfriendly to GCs though (ie heavy tombstones, very wide partitions).


On Fri, Nov 25, 2016 at 3:25 PM, S Ahmed <> wrote:

> Hello!
> From what I understand java GC pauses are pretty much a fact of life, but
> you can tune the jvm to reduce the likelihood of the frequency and length
> of GC pauses.
> When using Cassandra, how frequent or long have these pauses known to be?
> Even with tuning, is it safe to assume they cannot be eliminated?
> Would a 20-30 second pause be something out of the ordinary?
> Thanks.

Re: Can a Select Count(*) Affect Writes in Cassandra?

2016-11-10 Thread Chris Lohfink
count(*) actually pages through all the data. So a select count(*) without
a limit would be expected to cause a lot of load on the system. The hit is
more than just IO load and CPU, it also creates a lot of garbage that can
cause pauses slowing down the entire JVM. Some details here:

You may want to consider maintaining the count yourself, using Spark, or if
you just want a ball park number you can grab it from JMX.

> Cassandra writes (mutations) are INSERTs, UPDATEs or DELETEs, it actually
has nothing to do with flushes. A flush is the operation of moving data
from memory (memtable) to disk (SSTable).

FWIW in 2.0 thats not completely accurate. Before 2.1 the process of
memtable flushing acquired a switchlock on that blocks mutations during the
flush (the "pending task" metric is the measure of how many mutations are
blocked by this lock).


On Thu, Nov 10, 2016 at 8:10 AM, Shalom Sagges <>

> Hi Alexander,
> I'm referring to Writes Count generated from JMX:
> [image: Inline image 1]
> The higher curve shows the total write count per second for all nodes in
> the cluster and the lower curve is the average write count per second per
> node.
> The drop in the end is the result of shutting down one application node
> that performed this kind of query (we still haven't removed the query
> itself in this cluster).
> On a different cluster, where we already removed the "select count(*)"
> query completely, we can see that the issue was resolved (also verified
> this with running nodetool cfstats a few times and checked the write count
> difference):
> [image: Inline image 2]
> Naturally I asked how can a select query affect the write count of a node
> but weird as it seems, the issue was resolved once the query was removed
> from the code.
> Another side note.. One of our developers that wrote the query in the
> code, thought it would be nice to limit the query results to 560,000,000.
> Perhaps the ridiculously high limit might have caused this?
> Thanks!
> Shalom Sagges
> T: +972-74-700-4035
> <> <>
> <> We Create Meaningful Connections
> <>
> On Thu, Nov 10, 2016 at 3:21 PM, Alexander Dejanovski <
>> wrote:
>> Hi Shalom,
>> Cassandra writes (mutations) are INSERTs, UPDATEs or DELETEs, it actually
>> has nothing to do with flushes. A flush is the operation of moving data
>> from memory (memtable) to disk (SSTable).
>> The Cassandra write path and read path are two different things and, as
>> far as I know, I see no way for a select count(*) to increase your write
>> count (if you are indeed talking about actual Cassandra writes, and not I/O
>> operations).
>> Cheers,
>> On Thu, Nov 10, 2016 at 1:21 PM Shalom Sagges <>
>> wrote:
>>> Yes, I know it's obsolete, but unfortunately this takes time.
>>> We're in the process of upgrading to 2.2.8 and 3.0.9 in our clusters.
>>> Thanks!
>>> Shalom Sagges
>>> DBA
>>> T: +972-74-700-4035 <+972%2074-700-4035>
>>> <> <>
>>> <> We Create Meaningful Connections
>>> <>
>>> On Thu, Nov 10, 2016 at 1:31 PM, Vladimir Yudovin <>
>>> wrote:
>>> As I said I'm not sure about it, but it will be interesting to check
>>> memory heap state with any JMX tool, e.g.
>>> -r/jvmtop
>>> By a way, why Cassandra 2.0.14? It's quit old and unsupported version.
>>> Even in 2.0 branch there is 2.0.17 available.
>>> Best regards, Vladimir Yudovin,
>>> *Winguzone <> - Hosted Cloud
>>> CassandraLaunch your cluster in minutes.*
>>>  On Thu, 10 Nov 2016 05:47:37 -0500*Shalom Sagges
>>> < <>>* wrote 

Re: metrics not resetting after running proxyhistograms or cfhistograms

2016-10-25 Thread Chris Lohfink
That behavior went away with 2.2. adds decay to it to
make it recent data which is much better then just reseting on reads.


On Tue, Oct 25, 2016 at 2:06 PM, Andrew Bialecki <> wrote:

> We're running 3.6. Running "nodetool proxyhistograms" twice, we're seeing
> the same data returned each time, but expecting the second run to be reset.
> We're seeing the same behavior with "nodetool cfhistograms."
> I believe resetting after each call used to be the behavior, did that
> change in recent version? We've confirmed metrics reset after the service
> is restarted.
> --
> AB

Re: system_distributed.repair_history table

2016-10-06 Thread Chris Lohfink
small reminder that unless you have autosnapshot to false in
cassandra.yaml, you will need to clear snapshot (nodetool
clearsnapshot system_distributed) to actually delete the sstables

On Thu, Oct 6, 2016 at 9:25 AM, Saladi Naidu <> wrote:

> Thanks for the response. It makes sense to periodically truncate as it is
> only for debugging purposes
> Naidu Saladi
> On Wednesday, October 5, 2016 8:03 PM, Chris Lohfink <>
> wrote:
> The only current solution is to truncate it periodically. I opened
> about it if
> interested in following
> On Wed, Oct 5, 2016 at 4:23 PM, Saladi Naidu <>
> wrote:
> We are seeing following warnings in system.log,  As *compaction_large_
> partition_warning_threshold_mb*   in cassandra.yaml file is as default
> value 100, we are seeing these warnings
> 110:WARN  [CompactionExecutor:91798] 2016-10-05 00:54:05,554
> - Writing large partition
> system_distributed/repair_ history:gccatmer:mer_admin_job (115943239 bytes)
> 111:WARN  [CompactionExecutor:91798] 2016-10-05 00:54:13,303 
> - Writing large partition system_distributed/repair_ 
> history:gcconfigsrvcks:user_ activation (163926097 bytes)
> When I looked at the table definition it is partitioned by keyspace and 
> cloumnfamily, under this partition, repair history is maintained. When I 
> looked at the count of rows in this partition, most of the paritions have 
> >200,000 rows and these will keep growing because of the partition strategy 
> right. There is no TTL on this so any idea what is the solution for reducing 
> partition size.
> I also looked at size_estimates table for this column family and found that 
> the mean partition size for each range is 50,610,179 which is very large 
> compared to any other tables.

Re: system_distributed.repair_history table

2016-10-05 Thread Chris Lohfink
The only current solution is to truncate it periodically. I opened about it if
interested in following

On Wed, Oct 5, 2016 at 4:23 PM, Saladi Naidu  wrote:

> We are seeing following warnings in system.log,  As
> *compaction_large_partition_warning_threshold_mb*  in cassandra.yaml file
> is as default value 100, we are seeing these warnings
> 110:WARN  [CompactionExecutor:91798] 2016-10-05 00:54:05,554
> - Writing large partition 
> system_distributed/repair_history:gccatmer:mer_admin_job
> (115943239 bytes)
> 111:WARN  [CompactionExecutor:91798] 2016-10-05 00:54:13,303 
> - Writing large partition 
> system_distributed/repair_history:gcconfigsrvcks:user_activation (163926097 
> bytes)
> When I looked at the table definition it is partitioned by keyspace and 
> cloumnfamily, under this partition, repair history is maintained. When I 
> looked at the count of rows in this partition, most of the paritions have 
> >200,000 rows and these will keep growing because of the partition strategy 
> right. There is no TTL on this so any idea what is the solution for reducing 
> partition size.
> I also looked at size_estimates table for this column family and found that 
> the mean partition size for each range is 50,610,179 which is very large 
> compared to any other tables.

Re: repair_history maintenance

2016-09-23 Thread Chris Lohfink
Probably should just periodically truncate/clear snapshots when gets too
big (will probably take months before noticeable). I opened for discussion on if
it should use TTLs


On Thu, Sep 22, 2016 at 1:28 PM, <>

> Should there be a maintenance schedule for repair_history? Meaning, a
> scheduled nodetool repair and/or deletion schedule? Or is it the intention
> that this table just grow for the life of the cluster?

Finding records that exist on Cassandra but not externally

2016-09-07 Thread chris
First off I hope this appropriate here- I couldn't decide whether this was a 
question for Cassandra users or spark users so if you think it's in the wiring 
place feel free to redirect me.

I have a system that does a load of data manipulation using spark.  The output 
of this program is a effectively the new state that I want my Cassandra table 
to be in and the final step is to update Cassandra so that it matches this 

At present I'm currently inserting all rows in my generated state into 
Cassandra. This works for new rows and also for updating existing rows but 
doesn't of course delete any rows that were already in Cassandra but not in my 
new state. 
The problem I have now is how best to delete these missing rows. Options I have 
considered are:

1. Setting a ttl on inserts which is roughly the same as my data refresh 
period. This would probably be pretty performant but I really don't want to do 
this because it would mean that all data in my database would disappear if I 
had issues running my refresh task!

2. Every time I refresh the data I would first have to fetch all primary keys 
from Cassandra and, compare them to primary keys locally to create a list of 
pks to delete before the insert. This seems the most logicaly correct option 
but is going to result in reading vast amounts of data from Cassandra.

3. Truncating the entire table before refreshing Cassandra. This has the 
benefit of being pretty simple in code but I'm not sure of the performance 
implications of this and what will happen if I truncate while a node is offline.

For reference the table is on the order of 10s of millions of rows and for any 
data refresh only a very small fraction (<.1%) will actually need deleting. 99% 
of the time I'll just be overwriting existing keys. 

I'd be grateful if anyone could shed some advice on the best solution here or 
whether there's some better way I haven't thought of.



Re: How to get information of each read/write request?

2016-08-30 Thread Chris Lohfink
Running a query with trace (`TRACING ON` in cqlsh) can give you a lot of
the information for an individual request. There has been a ticket to track
time in queue ( but no
ones worked on it yet.


On Tue, Aug 30, 2016 at 12:20 PM, Jun Wu <> wrote:

> Hi there,
>  I'm very interested in the read/write path of Cassandra.
> Specifically, I'd like to know the whole process when a read/write request
> comes in.
> I noticed that for reach request it could go through multiple stages.
> For example, for read request, it could be in ReadStage,
> RequestResponseStage, ReadRepairStage. For each stage, actually it's a
> queue and thread pool to serve the request.
>First question is how to track each request in which stage.
>Also I'm very interested int the waiting time for each request to be in
> the queue, also the total queue in each stage. I noticed that in nodetool
> tpstats will have this information. However, I may want to get the
> real-time information of this, like print it out in the terminal.
> I'm wondering  whether someone has hints on this.
>Thanks in advance!
> Jun

Re: Support/Consulting companies

2016-08-19 Thread Chris Tozer
Instaclustr ( ) also offers Cassandra consulting

On Friday, August 19, 2016, <> wrote:

> Yes, TLP is the place to go!
> Joe
> Sent from my iPhone
> On Aug 19, 2016, at 12:03 PM, Huang, Roger <
> <javascript:_e(%7B%7D,'cvml','');>> wrote:
> *From:* Roxy Ubi [
> <javascript:_e(%7B%7D,'cvml','');>]
> *Sent:* Friday, August 19, 2016 2:02 PM
> *To:*
> <javascript:_e(%7B%7D,'cvml','');>
> *Subject:* Support/Consulting companies
> Howdy,
> I'm looking for a list of support or consulting companies that provide
> contracting services related to Cassandra.  Is there a comprehensive list
> somewhere?  Alternatively could you folks tell me who you use?
> Thanks in advance for any replies!
> Roxy

Chris Tozer


(408) 781-7914

Spin Up a Free 14 Day Trial <>

Re: Hintedhandoff mutation

2016-08-17 Thread Chris Lohfink
Probably question better suited for the dev@ list. But I afaik the answer
is there is no way to tell the difference, but probably safe to look at the
created time, HHs tend to be older.


On Wed, Aug 17, 2016 at 5:02 AM, Stone Fang <> wrote:

> Hi All,
> I want to differ hintedhandoff mutation and normal write mutation when i
> receive a mutation.
> how to get this in cassandra source code.have not found any attribute
> about this in Mutation class.
> or there is no way to get this.
> thanks
> stone

Re: a solution of getting cassandra cross-datacenter latency at a certain time

2016-08-08 Thread Chris Lohfink
If you invoke the values operation on the mbean every minute (or whatever
period) you can get a histogram of the cross dc the latencies. Just keep
track of the values of each bin in histogram and look at the delta from
previous time to the current time to find how many latencies occurred in
each bins range during the period.

Also can wait for CASSANDRA-11752
<> for the a "recent"
histogram (although would need to apply it to this histogram as well).

Chris Lohfink

On Mon, Aug 8, 2016 at 8:50 AM, Ryan Svihla <> wrote:

> The first issue I can think of is the Latency table, if I understand you
> correctly, has an unbounded size for the partition key of DC and will over
> time just get larger as more measurements are recorded.
> Regards,
> Ryan Svihla
> On Aug 8, 2016, at 2:58 AM, Stone Fang <> wrote:
> *objective*:get cassandra cross-datacenter latency in time
> *existing ticket:*
> there is a ticket [track cross-datacenter latency](https://issues.
> but it is a statistics value from node starting,i want to get the
> instantaneous value in a certain time.
> *thought*
> want to write a message into **MESSAGE TABLE** in 1s timer task(the period
> is similar to most of cross datacenter latency )
> ,and replicate to other datacenter,there will be a delay.and I capture
> it,and write to **LATENCY TABLE**.i can query the latency value from this
> table with the condition of certain time.
> *schema*
> message table for replicating data cross datacenter
> create keyspace heartbeat with replication=
> {'class':'NetworkTopologyStrategy','dc1':1, 'dc2':1...};
> }
> latency Table for querying latency value
> problems
> 1.can this solution work to get the cross-datacenter latency?
> 2.create heartbeat keyspace in cassandra bootstrap process,i need to load
> Heartbeat keyspace in save this keyspace into SystemSchema.
> also need to check if this keyspace has exist after first node i
> think this is not a good solution.
> 3.compared to 1,try another solution.generate heartbeat message in a
> standalone jar.but always i need to capture heartbeat message mutation in
> i need to check if the mutation is about heartbeat message.and
> it seems strange that check the heartbeat keyspace which is not defined in
> cassandra,but third-party.
> hope to see your thought on this.
> thanks
> stone

Re: Approximate row count

2016-07-27 Thread Chris Lohfink
the number of keys are the number of *partition keys, *not row keys. You
have ~39434 partitions, ranging from 311 bytes to 386mb. Looks like you
have some wide partitions that contain many of your rows.

Chris Lohfink

On Wed, Jul 27, 2016 at 1:44 PM, Luke Jolly <> wrote:

> I have a table that I'm storing ad impression data in with every row being
> an impression.  I want to get a count of total rows / impressions.  I know
> that there is in the ball park of 200-400 million rows in this table and
> from my reading "Number of keys" in the output of cfstats should be a
> reasonably accurate estimate. However, it is 39434. Am I misunderstanding
> something? Every node in my cluster has a complete copy of the keyspace.
>   Table: impressions_2
>   SSTable count: 22
>   Space used (live): 51255709817
>   Space used (total): 51255709817
>   Space used by snapshots (total): 49415721741
>   Off heap memory used (total): 30824975
>   SSTable Compression Ratio: 0.20347134631246266
>   Number of keys (estimate): 39434
>   Memtable cell count: 18279
>   Memtable data size: 15897457
>   Memtable off heap memory used: 0
>   Memtable switch count: 1294
>   Local read count: 347016
>   Local read latency: 12.573 ms
>   Local write count: 109226238
>   Local write latency: 0.023 ms
>   Pending flushes: 0
>   Bloom filter false positives: 655
>   Bloom filter false ratio: 0.0
>   Bloom filter space used: 97552
>   Bloom filter off heap memory used: 97376
>   Index summary off heap memory used: 26719
>   Compression metadata off heap memory used: 30700880
>   Compacted partition minimum bytes: 311
>   Compacted partition maximum bytes: 386857368
>   Compacted partition mean bytes: 6424107
>   Average live cells per slice (last five minutes): 
> 1027.9502011434631
>   Maximum live cells per slice (last five minutes): 5722
>   Average tombstones per slice (last five minutes): 1.0
>   Maximum tombstones per slice (last five minutes): 1

RE: My cluster shows high system load without any apparent reason

2016-07-22 Thread Chris Lee
Unsubscribe me.

Thank you

From: Ryan Svihla []
Sent: viernes, 22 de julio de 2016 14:39
Subject: Re: My cluster shows high system load without any apparent reason

You aren't using counters by chance?


Ryan Svihla

On Jul 22, 2016, 2:00 PM -0500, Mark Rose 
>, wrote:

Hi Garo,

Are you using XFS or Ext4 for data? XFS is much better at deleting
large files, such as may happen after a compaction. If you have 26 TB
in just two tables, I bet you have some massive sstables which may
take a while for Ext4 to delete, which may be causing the stalls. The
underlying block layers will not show high IO-wait. See if the stall
times line up with large compactions in system.log.

If you must use Ext4, another way to avoid issues with massive
sstables is to run more, smaller instances.

As an aside, for the amount of reads/writes you're doing, I've found
using c3/m3 instances with the commit log on the ephemeral storage and
data on st1 EBS volumes to be much more cost effective. It's something
to look into if you haven't already.


On Fri, Jul 22, 2016 at 8:10 AM, Juho Mäkinen 
> wrote:

After a few days I've also tried disabling Linux kernel huge pages
defragement (echo never > /sys/kernel/mm/transparent_hugepage/defrag) and
turning coalescing off (otc_coalescing_strategy: DISABLED), but either did
do any good. I'm using LCS, there are no big GC pauses, and I have set
"concurrent_compactors: 5" (machines have 16 CPUs), but there are usually
not any compactions running when the load spike comes. "nodetool tpstats"
shows no running thread pools except on the Native-Transport-Requests
(usually 0-4) and perhaps ReadStage (usually 0-1).

The symptoms are the same: after about 12-24 hours increasingly number of
nodes start to show short CPU load spikes and this affects the median read
latencies. I ran a dstat when a load spike was already under way (see
screenshot, but any other column than the
load itself doesn't show any major change except the system/kernel CPU

All further ideas how to debug this are greatly appreciated.

On Wed, Jul 20, 2016 at 7:13 PM, Juho Mäkinen 

Re: sstabledump failing for system keyspace tables

2016-06-11 Thread Chris Lohfink
related to, most of
the system tables will work but batches are kinda special cased and uses
the localpartitioner (see:
 ) like secondary indexes but isnt caught by the 2i check to use the local

If you want you can open a jira for this or I can later. A workaround in
meantime while waiting for a fix may be to actually use a relative path
with a ".." or "." in it to take advantage of the issue mentioned in this


On Sat, Jun 11, 2016 at 3:00 PM, Bhuvan Rawal <> wrote:

> I have been trying to obtain json dump of batches table using sstabledump
> but I get this exception:
> $ sstabledump
> /sstable/data/system/batches-919a4bc57a333573b03e13fc3f68b465/ma-277-big-Data.db
> Exception in thread "main"
> org.apache.cassandra.exceptions.ConfigurationException: Cannot use abstract
> class 'org.apache.cassandra.dht.LocalPartitioner' as partitioner.
> at org.apache.cassandra.utils.FBUtilities.construct(
> at
> org.apache.cassandra.utils.FBUtilities.instanceOrConstruct(
> at
> org.apache.cassandra.utils.FBUtilities.newPartitioner(
> at
> at
> I further tried Andrew Tolbert's sstable tool but it gives the same
> exception.
> $ java -jar sstable-tools-3.0.0-alpha4.jar describe
> /sstable/data/system/batches-919a4bc57a333573b03e13fc3f68b465/ma-277-big-Data.db
> /sstable/data/system/batches-919a4bc57a333573b03e13fc3f68b465/ma-277-big-Data.db
> org.apache.cassandra.exceptions.ConfigurationException: Cannot use
> abstract class 'org.apache.cassandra.dht.LocalPartitioner' as partitioner.
> at org.apache.cassandra.utils.FBUtilities.construct(
> Any way by which I can figure out the content of batches table?
> Thanks & Regards,
> Bhuvan

Re: Latency overhead on Cassandra cluster deployed on multiple AZs (AWS)

2016-04-11 Thread Chris Lohfink
Where do you get the ~1ms latency between AZs? Comparing a short term
average to a 99th percentile isn't very fair.

"Over the last month, the median is 2.09 ms, 90th percentile is 20ms,
99th percentile
is 47ms." - per

Are you using EBS? That would further impact latency on reads and GCs will
always cause hiccups in the 99th+.


On Mon, Apr 11, 2016 at 7:57 AM, Alessandro Pieri <> wrote:

> Hi everyone,
> Last week I ran some tests to estimate the latency overhead introduces in
> a Cassandra cluster by a multi availability zones setup on AWS EC2.
> I started a Cassandra cluster of 6 nodes deployed on 3 different AZs (2
> nodes/AZ).
> Then, I used cassandra-stress to create an INSERT (write) test of 20M
> entries with a replication factor = 3, right after, I ran cassandra-stress
> again to READ 10M entries.
> Well, I got the following unexpected result:
> Single-AZ, CL=ONE -> median/95th percentile/99th percentile:
> 1.06ms/7.41ms/55.81ms
> Multi-AZ, CL=ONE -> median/95th percentile/99th percentile:
> 1.16ms/38.14ms/47.75ms
> Basically, switching to the multi-AZ setup the latency increased of ~30ms.
> That's too much considering the the average network latency between AZs on
> AWS is ~1ms.
> Since I couldn't find anything to explain those results, I decided to run
> the cassandra-stress specifying only a single node entry (i.e. "--nodes
> node1" instead of "--nodes node1,node2,node3,node4,node5,node6") and
> surprisingly the latency went back to 5.9 ms.
> Trying to recap:
> Multi-AZ, CL=ONE, "--nodes node1,node2,node3,node4,node5,node6" -> 95th
> percentile: 38.14ms
> Multi-AZ, CL=ONE, "--nodes node1" -> 95th percentile: 5.9ms
> For the sake of completeness I've ran a further test using a consistency
> level = LOCAL_QUORUM and the test did not show any large variance with
> using a single node or multiple ones.
> Do you guys know what could be the reason?
> The test were executed on a m3.xlarge (network optimized) using the
> DataStax AMI 2.6.3 running Cassandra v2.0.15.
> Thank you in advance for your help.
> Cheers,
> Alessandro

Cassandra nodes using internal network to try and talk externally

2016-04-07 Thread Chris Elsmore

I have a Cassandra 2.2.5 cluster with a datacenter DC03 with 5 nodes in a ring 
and I have DC04 with one node. 

Setup by default with all nodes talking on the external interfaces works well, 
no problems, all nodes in each DC can see and talk to each other.

I’m trying to follow the instructions here
 for the node in DC04 in preparation of adding a new node.

When I follow the instructions to set the listen_address to the internal 
address, broadcast address to the external address and to set 
listen_on_broadcast to true, the nodes in DC03 can connect but do not handshake 
with the node in DC04. The output of ‘lsof -i -P | grep 7000’ shows that the 
node in DC04 is trying to connect to the IPs of the nodes in DC04 over the 
internal network, which obviously doesn’t work.

Any clues? I’m at a loss!


Re: DataModelling to query date range

2016-03-24 Thread Chris Martin
Ah- that looks interesting!  I'm actaully still on cassandra 2.x but I was
planning on updgrading anyway.  Once I do so I'll check this one out.


On Thu, Mar 24, 2016 at 2:57 AM, Henry M <> wrote:

> I haven't tried the new SASI indexer but it may help:
> On Wed, Mar 23, 2016 at 2:08 PM, Chris Martin <>
> wrote:
>> Hi all,
>> I have a table that represents a train timetable and looks a bit like
>> this:
>> CREATE TABLE routes (
>> start text,
>> end text,
>> validFrom timestamp,
>> validTo timestamp,
>> PRIMARY KEY (start, end, validFrom, validTo)
>> );
>> In this case validFrom is the date that the route becomes valid and
>> validTo is the date that the route that stops becoming valid.
>> If this was SQL I could write a query to find all valid routes between
>> New York and Washington from Jan 1st 2016 to Jan 31st 2016 using something
>> like:
>> SELECT * from routes where start = 'New York' and end = 'Washington' and 
>> validFrom <= 2016-01-31 and validTo >= 2016-01-01.
>> As far as I can tell such a query is impossible with CQL and my current
>> table structure.  I'm considering running a query like:
>> SELECT * from routes where start = 'New York' and end = 'Washington' and 
>> validFrom <= 2016-01-31
>> And then filtering the rest of the data app side.  This doesn't seem
>> ideal though as I'm going to end up fetching much more data (probably
>> around an order of magnitude more) from Cassandra than I really want.
>> Is there a better way to model the data?
>> thanks,
>> Chris

Re: DataModelling to query date range

2016-03-24 Thread Chris Martin
Hi Vidur,

I had a go at your solution but the problem is that it doesn't match routes
which are valid all throughtout the range queried.  For example if I have
 route that is valid for all of Jan 2016. I will have a table that looks
something like this:

start   | end| valid
New York   Washington 2016-01-01
New York   Washington 2016-01-31

So if I query for ranges that have at least one bound outside Jan (e.g Jan
15 - Feb 15) then the query you gave will work fine.  If, however, I query
for a range that is completely inside Jan e.g all routes valid on Jan 15th,
 The I think I'll end up with a query like:

SELECT * from routes where start = 'New York' and end = 'Washington'
and valid <= 2016-01-15 and valid >= 2016-01-15.

which will return 0 results as it would only match routes that have a valid
of 2016-01-15 exactly.



On Wed, Mar 23, 2016 at 11:19 PM, Vidur Malik <> wrote:

> Flip the problem over. Instead of storing validTo and validFrom, simply
> store a valid field and partition by (start, end). This may sound wasteful,
> but disk is cheap:
> CREATE TABLE routes (
> start text,
> end text,
> valid timestamp,
> PRIMARY KEY ((start, end), valid)
> );
> Now, you can execute something like:
> SELECT * from routes where start = 'New York' and end = 'Washington' and 
> valid <= 2016-01-31 and valid >= 2016-01-01.
> On Wed, Mar 23, 2016 at 5:08 PM, Chris Martin <>
> wrote:
>> Hi all,
>> I have a table that represents a train timetable and looks a bit like
>> this:
>> CREATE TABLE routes (
>> start text,
>> end text,
>> validFrom timestamp,
>> validTo timestamp,
>> PRIMARY KEY (start, end, validFrom, validTo)
>> );
>> In this case validFrom is the date that the route becomes valid and
>> validTo is the date that the route that stops becoming valid.
>> If this was SQL I could write a query to find all valid routes between
>> New York and Washington from Jan 1st 2016 to Jan 31st 2016 using something
>> like:
>> SELECT * from routes where start = 'New York' and end = 'Washington' and 
>> validFrom <= 2016-01-31 and validTo >= 2016-01-01.
>> As far as I can tell such a query is impossible with CQL and my current
>> table structure.  I'm considering running a query like:
>> SELECT * from routes where start = 'New York' and end = 'Washington' and 
>> validFrom <= 2016-01-31
>> And then filtering the rest of the data app side.  This doesn't seem
>> ideal though as I'm going to end up fetching much more data (probably
>> around an order of magnitude more) from Cassandra than I really want.
>> Is there a better way to model the data?
>> thanks,
>> Chris
> --
> Vidur Malik
> [image: ShopKeep] <>
> 800.820.9814
> <8008209814>
> [image: ShopKeep] <> [image: ShopKeep]
> <> [image: ShopKeep]
> <>

  1   2   3   4   >