Re: Can a Select Count(*) Affect Writes in Cassandra?
Shalom, you may have a high trace probability which could explain what you're observing : https://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsSetTraceProbability.html On Thu, Nov 10, 2016 at 3:37 PM Chris Lohfink <clohfin...@gmail.com> wrote: > count(*) actually pages through all the data. So a select count(*) without > a limit would be expected to cause a lot of load on the system. The hit is > more than just IO load and CPU, it also creates a lot of garbage that can > cause pauses slowing down the entire JVM. Some details here: > http://www.datastax.com/dev/blog/counting-keys-in-cassandra > <http://planetcassandra.org/blog/counting-key-in-cassandra/> > > You may want to consider maintaining the count yourself, using Spark, or > if you just want a ball park number you can grab it from JMX. > > > Cassandra writes (mutations) are INSERTs, UPDATEs or DELETEs, it > actually has nothing to do with flushes. A flush is the operation of moving > data from memory (memtable) to disk (SSTable). > > FWIW in 2.0 thats not completely accurate. Before 2.1 the process of > memtable flushing acquired a switchlock on that blocks mutations during the > flush (the "pending task" metric is the measure of how many mutations are > blocked by this lock). > > Chris > > On Thu, Nov 10, 2016 at 8:10 AM, Shalom Sagges <shal...@liveperson.com> > wrote: > > Hi Alexander, > > I'm referring to Writes Count generated from JMX: > [image: Inline image 1] > > The higher curve shows the total write count per second for all nodes in > the cluster and the lower curve is the average write count per second per > node. > The drop in the end is the result of shutting down one application node > that performed this kind of query (we still haven't removed the query > itself in this cluster). > > > On a different cluster, where we already removed the "select count(*)" > query completely, we can see that the issue was resolved (also verified > this with running nodetool cfstats a few times and checked the write count > difference): > [image: Inline image 2] > > > Naturally I asked how can a select query affect the write count of a node > but weird as it seems, the issue was resolved once the query was removed > from the code. > > Another side note.. One of our developers that wrote the query in the > code, thought it would be nice to limit the query results to 560,000,000. > Perhaps the ridiculously high limit might have caused this? > > Thanks! > > > > Shalom Sagges > DBA > T: +972-74-700-4035 > <http://www.linkedin.com/company/164748> <http://twitter.com/liveperson> > <http://www.facebook.com/LivePersonInc> We Create Meaningful Connections > > <https://engage.liveperson.com/idc-mobile-first-consumer/?utm_medium=email_source=mkto_campaign=idcsig> > > > On Thu, Nov 10, 2016 at 3:21 PM, Alexander Dejanovski < > a...@thelastpickle.com> wrote: > > Hi Shalom, > > Cassandra writes (mutations) are INSERTs, UPDATEs or DELETEs, it actually > has nothing to do with flushes. A flush is the operation of moving data > from memory (memtable) to disk (SSTable). > > The Cassandra write path and read path are two different things and, as > far as I know, I see no way for a select count(*) to increase your write > count (if you are indeed talking about actual Cassandra writes, and not I/O > operations). > > Cheers, > > On Thu, Nov 10, 2016 at 1:21 PM Shalom Sagges <shal...@liveperson.com> > wrote: > > Yes, I know it's obsolete, but unfortunately this takes time. > We're in the process of upgrading to 2.2.8 and 3.0.9 in our clusters. > > Thanks! > > > > Shalom Sagges > DBA > T: +972-74-700-4035 <+972%2074-700-4035> > <http://www.linkedin.com/company/164748> <http://twitter.com/liveperson> > <http://www.facebook.com/LivePersonInc> We Create Meaningful Connections > > <https://engage.liveperson.com/idc-mobile-first-consumer/?utm_medium=email_source=mkto_campaign=idcsig> > > > On Thu, Nov 10, 2016 at 1:31 PM, Vladimir Yudovin <vla...@winguzone.com> > wrote: > > As I said I'm not sure about it, but it will be interesting to check > memory heap state with any JMX tool, e.g. > https://github.com/patric-r/jvmtop > > By a way, why Cassandra 2.0.14? It's quit old and unsupported version. > Even in 2.0 branch there is 2.0.17 available. > > Best regards, Vladimir Yudovin, > > *Winguzone <https://winguzone.com?from=list> - Hosted Cloud > CassandraLaunch your cluster in minutes.* > > > On Thu, 10 Nov 2016 05:47:37 -0500*Shalom Sagges > <shal...@liveperson.com <shal...@liveperson.com>>
Re: Can a Select Count(*) Affect Writes in Cassandra?
Could you check the write count on a per table basis in order to check which specific table is actually receiving writes ? Check the OneMinuteRate metric in org.apache.cassandra.metrics:type=ColumnFamily,keyspace=*keyspace1*,scope= *standard1*,name=WriteLatency (Make sure you replace keyspace and table name here). Also, check if you have tracing turned on as it can indeed generate writes for every query you send in the sessions and events table : https://docs.datastax.com/en/cql/3.1/cql/cql_reference/tracing_r.html Cheers, On Thu, Nov 10, 2016 at 3:11 PM Shalom Sagges <shal...@liveperson.com> wrote: > Hi Alexander, > > I'm referring to Writes Count generated from JMX: > [image: Inline image 1] > > The higher curve shows the total write count per second for all nodes in > the cluster and the lower curve is the average write count per second per > node. > The drop in the end is the result of shutting down one application node > that performed this kind of query (we still haven't removed the query > itself in this cluster). > > > On a different cluster, where we already removed the "select count(*)" > query completely, we can see that the issue was resolved (also verified > this with running nodetool cfstats a few times and checked the write count > difference): > [image: Inline image 2] > > > Naturally I asked how can a select query affect the write count of a node > but weird as it seems, the issue was resolved once the query was removed > from the code. > > Another side note.. One of our developers that wrote the query in the > code, thought it would be nice to limit the query results to 560,000,000. > Perhaps the ridiculously high limit might have caused this? > > Thanks! > > > > Shalom Sagges > DBA > T: +972-74-700-4035 <+972%2074-700-4035> > <http://www.linkedin.com/company/164748> <http://twitter.com/liveperson> > <http://www.facebook.com/LivePersonInc> We Create Meaningful Connections > > <https://engage.liveperson.com/idc-mobile-first-consumer/?utm_medium=email_source=mkto_campaign=idcsig> > > > On Thu, Nov 10, 2016 at 3:21 PM, Alexander Dejanovski < > a...@thelastpickle.com> wrote: > > Hi Shalom, > > Cassandra writes (mutations) are INSERTs, UPDATEs or DELETEs, it actually > has nothing to do with flushes. A flush is the operation of moving data > from memory (memtable) to disk (SSTable). > > The Cassandra write path and read path are two different things and, as > far as I know, I see no way for a select count(*) to increase your write > count (if you are indeed talking about actual Cassandra writes, and not I/O > operations). > > Cheers, > > On Thu, Nov 10, 2016 at 1:21 PM Shalom Sagges <shal...@liveperson.com> > wrote: > > Yes, I know it's obsolete, but unfortunately this takes time. > We're in the process of upgrading to 2.2.8 and 3.0.9 in our clusters. > > Thanks! > > > > Shalom Sagges > DBA > T: +972-74-700-4035 <+972%2074-700-4035> > <http://www.linkedin.com/company/164748> <http://twitter.com/liveperson> > <http://www.facebook.com/LivePersonInc> We Create Meaningful Connections > > <https://engage.liveperson.com/idc-mobile-first-consumer/?utm_medium=email_source=mkto_campaign=idcsig> > > > On Thu, Nov 10, 2016 at 1:31 PM, Vladimir Yudovin <vla...@winguzone.com> > wrote: > > As I said I'm not sure about it, but it will be interesting to check > memory heap state with any JMX tool, e.g. > https://github.com/patric-r/jvmtop > > By a way, why Cassandra 2.0.14? It's quit old and unsupported version. > Even in 2.0 branch there is 2.0.17 available. > > Best regards, Vladimir Yudovin, > > *Winguzone <https://winguzone.com?from=list> - Hosted Cloud > CassandraLaunch your cluster in minutes.* > > > On Thu, 10 Nov 2016 05:47:37 -0500*Shalom Sagges > <shal...@liveperson.com <shal...@liveperson.com>>* wrote > > Thanks for the quick reply Vladimir. > Is it really possible that ~12,500 writes per second (per node in a 12 > nodes DC) are caused by memory flushes? > > > > > > > Shalom Sagges > DBA > T: +972-74-700-4035 > <http://www.linkedin.com/company/164748> > <http://twitter.com/liveperson> > <http://www.facebook.com/LivePersonInc> > We Create Meaningful Connections > > <https://engage.liveperson.com/idc-mobile-first-consumer/?utm_medium=email_source=mkto_campaign=idcsig> > > > > On Thu, Nov 10, 2016 at 11:02 AM, Vladimir Yudovin <vla...@winguzone.com> > wrote: > > > > This message may contain confidential and/or privileged information. > If you are not the addressee or authorized to
Re: Can a Select Count(*) Affect Writes in Cassandra?
Hi Shalom, Cassandra writes (mutations) are INSERTs, UPDATEs or DELETEs, it actually has nothing to do with flushes. A flush is the operation of moving data from memory (memtable) to disk (SSTable). The Cassandra write path and read path are two different things and, as far as I know, I see no way for a select count(*) to increase your write count (if you are indeed talking about actual Cassandra writes, and not I/O operations). Cheers, On Thu, Nov 10, 2016 at 1:21 PM Shalom Sagges <shal...@liveperson.com> wrote: > Yes, I know it's obsolete, but unfortunately this takes time. > We're in the process of upgrading to 2.2.8 and 3.0.9 in our clusters. > > Thanks! > > > > Shalom Sagges > DBA > T: +972-74-700-4035 <+972%2074-700-4035> > <http://www.linkedin.com/company/164748> <http://twitter.com/liveperson> > <http://www.facebook.com/LivePersonInc> We Create Meaningful Connections > > <https://engage.liveperson.com/idc-mobile-first-consumer/?utm_medium=email_source=mkto_campaign=idcsig> > > > On Thu, Nov 10, 2016 at 1:31 PM, Vladimir Yudovin <vla...@winguzone.com> > wrote: > > As I said I'm not sure about it, but it will be interesting to check > memory heap state with any JMX tool, e.g. > https://github.com/patric-r/jvmtop > > By a way, why Cassandra 2.0.14? It's quit old and unsupported version. > Even in 2.0 branch there is 2.0.17 available. > > Best regards, Vladimir Yudovin, > > *Winguzone <https://winguzone.com?from=list> - Hosted Cloud > CassandraLaunch your cluster in minutes.* > > > On Thu, 10 Nov 2016 05:47:37 -0500*Shalom Sagges > <shal...@liveperson.com <shal...@liveperson.com>>* wrote > > Thanks for the quick reply Vladimir. > Is it really possible that ~12,500 writes per second (per node in a 12 > nodes DC) are caused by memory flushes? > > > > > > > Shalom Sagges > DBA > T: +972-74-700-4035 > <http://www.linkedin.com/company/164748> > <http://twitter.com/liveperson> > <http://www.facebook.com/LivePersonInc> > We Create Meaningful Connections > > <https://engage.liveperson.com/idc-mobile-first-consumer/?utm_medium=email_source=mkto_campaign=idcsig> > > > > On Thu, Nov 10, 2016 at 11:02 AM, Vladimir Yudovin <vla...@winguzone.com> > wrote: > > > > This message may contain confidential and/or privileged information. > If you are not the addressee or authorized to receive this on behalf of > the addressee you must not use, copy, disclose or take action based on this > message or any information herein. > If you have received this message in error, please advise the sender > immediately by reply email and delete this message. Thank you. > > > Hi Shalom, > > so not sure, but probably excessive memory consumption by this SELECT > causes C* to flush tables to free memory. > > Best regards, Vladimir Yudovin, > > *Winguzone <https://winguzone.com?from=list> - Hosted Cloud > CassandraLaunch your cluster in minutes.* > > > On Thu, 10 Nov 2016 03:36:59 -0500*Shalom Sagges > <shal...@liveperson.com <shal...@liveperson.com>>* wrote > > Hi There! > > I'm using C* 2.0.14. > I experienced a scenario where a "select count(*)" that ran every minute > on a table with practically no results limit (yes, this should definitely > be avoided), caused a huge increase in Cassandra writes to around 150 > thousand writes per second for that particular table. > > Can anyone explain this behavior? Why would a Select query significantly > increase write count in Cassandra? > > Thanks! > > > Shalom Sagges > > <http://www.linkedin.com/company/164748> > <http://twitter.com/liveperson> > <http://www.facebook.com/LivePersonInc> > We Create Meaningful Connections > > <https://engage.liveperson.com/idc-mobile-first-consumer/?utm_medium=email_source=mkto_campaign=idcsig> > > > > This message may contain confidential and/or privileged information. > If you are not the addressee or authorized to receive this on behalf of > the addressee you must not use, copy, disclose or take action based on this > message or any information herein. > If you have received this message in error, please advise the sender > immediately by reply email and delete this message. Thank you. > > > > > > This message may contain confidential and/or privileged information. > If you are not the addressee or authorized to receive this on behalf of > the addressee you must not use, copy, disclose or take action based on this > message or any information herein. > If you have received this message in error, please advise the sender > immediately by reply email and delete this message. Thank you. > -- - Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Re: Cassandra reaper
Running reaper with INFO level logging (that can be configured in the yaml file), you should have a console output telling you what's going on. If you started reaper with memory back end, restarting it will reset it and you'll have to register your cluster again, but if you used postgres it will resume tasks where they were left off. Please restart Reaper to at least have an output we can get information from, otherwise we're blind. Since you're using Cassandra 2.1, I'd advise switching to our fork since the original one is compiled against Cassandra 2.0 libraries. If you switch and use postgres, make sure you update the schema accordingly as we added fields for incremental repair support. Cheers, Le mar. 1 nov. 2016 18:31, Jai Bheemsen Rao Dhanwada <jaibheem...@gmail.com> a écrit : > Cassandra version is 2.1.16 > > In my setup I don't see it is writting to any logs > > On Tue, Nov 1, 2016 at 10:25 AM, Alexander Dejanovski < > a...@thelastpickle.com> wrote: > > Do you have anything in the reaper logs that would show a failure of some > sort ? > Also, can you tell me which version of Cassandra you're using ? > > Thanks > > On Tue, Nov 1, 2016 at 6:15 PM Jai Bheemsen Rao Dhanwada < > jaibheem...@gmail.com> wrote: > > Thanks Alex, > > Forgot to mention but I did add the cluster. See the status below. It says > the status is running but I don't see any repair happening. this is in the > same state from past 1 days. > b/w there not much of data in cluster. > > [root@machine cassandra-reaper]# ./bin/spreaper status-repair 3 > # Report improvements/bugs at > https://github.com/spotify/cassandra-reaper/issues > # > -- > # Repair run with id '3': > { > "cause": "manual spreaper run", > "cluster_name": "production", > "column_families": [], > "creation_time": "2016-11-01T00:39:15Z", > "duration": null, > "end_time": null, > "estimated_time_of_arrival": null, > "id": 3, > "intensity": 0.900, > "keyspace_name": "users", > * "last_event": "no events",* > "owner": "root", > "pause_time": null, > "repair_parallelism": "DATACENTER_AWARE", > "segments_repaired": 0, > "start_time": "2016-11-01T00:39:15Z", > * "state": "RUNNING",* > "total_segments": 301 > } > [root@ machine cassandra-reaper]# > > On Tue, Nov 1, 2016 at 9:24 AM, Alexander Dejanovski < > a...@thelastpickle.com> wrote: > > Hi, > > The first step in using reaper is to add a cluster to it, as it is a tool > that can manage multiple clusters and does not need to be executed on a > Cassandra node (you can run in on any edge node you want). > > You should run : ./bin/spreaper add-cluster 127.0.0.1 > Where you'll replace 127.0.0.1 by the address of one of the nodes of your > cluster. > > Then you can run : ./bin/spreaper cluster_name keyspace_name > to start repairing a keyspace. > > You might want to drop in the UI made by Stefan Podkowinski which might > ease things up for you, at least at the beginning : > https://github.com/spodkowinski/cassandra-reaper-ui > > Worth mentioning that at The Last Pickle we maintain a fork of Reaper that > handles incremental repair, works with C* 2.x and 3.0, and bundles the UI : > https://github.com/thelastpickle/cassandra-reaper > We have a branch that allows using Cassandra as a storage backend instead > of Postgres : > https://github.com/thelastpickle/cassandra-reaper/tree/add-cassandra-storage > It should be merged to master really soon and should be ready to use. > > Cheers, > > > On Tue, Nov 1, 2016 at 1:45 AM Jai Bheemsen Rao Dhanwada < > jaibheem...@gmail.com> wrote: > > Hello, > > Has anyone played around with the cassandra reaper ( > https://github.com/spotify/cassandra-reaper)? > > if so can some please help me with the set-up, I can't get it working. I > used the below steps: > > 1. create jar file using maven > 2. java -jar cassandra-reaper-0.2.3-SNAPSHOT.jar server > cassandra-reaper.yaml > 3. ./bin/spreaper repair production users > > -- > - > Alexander Dejanovski > France > @alexanderdeja > > Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com > > > -- > - > Alexander Dejanovski > France > @alexanderdeja > > Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com > > > -- - Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Re: Cassandra reaper
Do you have anything in the reaper logs that would show a failure of some sort ? Also, can you tell me which version of Cassandra you're using ? Thanks On Tue, Nov 1, 2016 at 6:15 PM Jai Bheemsen Rao Dhanwada < jaibheem...@gmail.com> wrote: > Thanks Alex, > > Forgot to mention but I did add the cluster. See the status below. It says > the status is running but I don't see any repair happening. this is in the > same state from past 1 days. > b/w there not much of data in cluster. > > [root@machine cassandra-reaper]# ./bin/spreaper status-repair 3 > # Report improvements/bugs at > https://github.com/spotify/cassandra-reaper/issues > # > -- > # Repair run with id '3': > { > "cause": "manual spreaper run", > "cluster_name": "production", > "column_families": [], > "creation_time": "2016-11-01T00:39:15Z", > "duration": null, > "end_time": null, > "estimated_time_of_arrival": null, > "id": 3, > "intensity": 0.900, > "keyspace_name": "users", > * "last_event": "no events",* > "owner": "root", > "pause_time": null, > "repair_parallelism": "DATACENTER_AWARE", > "segments_repaired": 0, > "start_time": "2016-11-01T00:39:15Z", > * "state": "RUNNING",* > "total_segments": 301 > } > [root@ machine cassandra-reaper]# > > On Tue, Nov 1, 2016 at 9:24 AM, Alexander Dejanovski < > a...@thelastpickle.com> wrote: > > Hi, > > The first step in using reaper is to add a cluster to it, as it is a tool > that can manage multiple clusters and does not need to be executed on a > Cassandra node (you can run in on any edge node you want). > > You should run : ./bin/spreaper add-cluster 127.0.0.1 > Where you'll replace 127.0.0.1 by the address of one of the nodes of your > cluster. > > Then you can run : ./bin/spreaper cluster_name keyspace_name > to start repairing a keyspace. > > You might want to drop in the UI made by Stefan Podkowinski which might > ease things up for you, at least at the beginning : > https://github.com/spodkowinski/cassandra-reaper-ui > > Worth mentioning that at The Last Pickle we maintain a fork of Reaper that > handles incremental repair, works with C* 2.x and 3.0, and bundles the UI : > https://github.com/thelastpickle/cassandra-reaper > We have a branch that allows using Cassandra as a storage backend instead > of Postgres : > https://github.com/thelastpickle/cassandra-reaper/tree/add-cassandra-storage > It should be merged to master really soon and should be ready to use. > > Cheers, > > > On Tue, Nov 1, 2016 at 1:45 AM Jai Bheemsen Rao Dhanwada < > jaibheem...@gmail.com> wrote: > > Hello, > > Has anyone played around with the cassandra reaper ( > https://github.com/spotify/cassandra-reaper)? > > if so can some please help me with the set-up, I can't get it working. I > used the below steps: > > 1. create jar file using maven > 2. java -jar cassandra-reaper-0.2.3-SNAPSHOT.jar server > cassandra-reaper.yaml > 3. ./bin/spreaper repair production users > > -- > - > Alexander Dejanovski > France > @alexanderdeja > > Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com > > > -- - Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Re: Cassandra reaper
Hi, The first step in using reaper is to add a cluster to it, as it is a tool that can manage multiple clusters and does not need to be executed on a Cassandra node (you can run in on any edge node you want). You should run : ./bin/spreaper add-cluster 127.0.0.1 Where you'll replace 127.0.0.1 by the address of one of the nodes of your cluster. Then you can run : ./bin/spreaper cluster_name keyspace_name to start repairing a keyspace. You might want to drop in the UI made by Stefan Podkowinski which might ease things up for you, at least at the beginning : https://github.com/spodkowinski/cassandra-reaper-ui Worth mentioning that at The Last Pickle we maintain a fork of Reaper that handles incremental repair, works with C* 2.x and 3.0, and bundles the UI : https://github.com/thelastpickle/cassandra-reaper We have a branch that allows using Cassandra as a storage backend instead of Postgres : https://github.com/thelastpickle/cassandra-reaper/tree/add-cassandra-storage It should be merged to master really soon and should be ready to use. Cheers, On Tue, Nov 1, 2016 at 1:45 AM Jai Bheemsen Rao Dhanwada < jaibheem...@gmail.com> wrote: > Hello, > > Has anyone played around with the cassandra reaper ( > https://github.com/spotify/cassandra-reaper)? > > if so can some please help me with the set-up, I can't get it working. I > used the below steps: > > 1. create jar file using maven > 2. java -jar cassandra-reaper-0.2.3-SNAPSHOT.jar server > cassandra-reaper.yaml > 3. ./bin/spreaper repair production users > -- - Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Re: Tools to manage repairs
Hi Eric, that would be https://issues.apache.org/jira/browse/CASSANDRA-9754 by Michael Kjellman and https://issues.apache.org/jira/browse/CASSANDRA-11206 by Robert Stupp. If you haven't seen it yet, Robert's summit talk on big partitions is totally worth it : Video : https://www.youtube.com/watch?v=N3mGxgnUiRY Slides : http://www.slideshare.net/DataStax/myths-of-big-partitions-robert-stupp-datastax-cassandra-summit-2016 Cheers, On Fri, Oct 28, 2016 at 4:09 PM Eric Evans <john.eric.ev...@gmail.com> wrote: > On Thu, Oct 27, 2016 at 4:13 PM, Alexander Dejanovski > <a...@thelastpickle.com> wrote: > > A few patches are pushing the limits of partition sizes so we may soon be > > more comfortable with big partitions. > > You don't happen to have Jira links to these handy, do you? > > > -- > Eric Evans > john.eric.ev...@gmail.com > -- - Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Re: Tools to manage repairs
The "official" recommendation would be 100MB, but it's hard to give a precise answer. Keeping it under the GB seems like a good target. A few patches are pushing the limits of partition sizes so we may soon be more comfortable with big partitions. Cheers Le jeu. 27 oct. 2016 21:28, Vincent Rischmann <m...@vrischmann.me> a écrit : > Yeah that particular table is badly designed, I intend to fix it, when the > roadmap allows us to do it :) > What is the recommended maximum partition size ? > > Thanks for all the information. > > > On Thu, Oct 27, 2016, at 08:14 PM, Alexander Dejanovski wrote: > > 3.3GB is already too high, and it's surely not good to have well > performing compactions. Still I know changing a data model is no easy thing > to do, but you should try to do something here. > > Anticompaction is a special type of compaction and if an sstable is being > anticompacted, then any attempt to run validation compaction on it will > fail, telling you that you cannot have an sstable being part of 2 repair > sessions at the same time, so incremental repair must be run one node at a > time, waiting for anticompactions to end before moving from one node to the > other. > > Be mindful of running incremental repair on a regular basis once you > started as you'll have two separate pools of sstables (repaired and > unrepaired) that won't get compacted together, which could be a problem if > you want tombstones to be purged efficiently. > > Cheers, > > Le jeu. 27 oct. 2016 17:57, Vincent Rischmann <m...@vrischmann.me> a écrit : > > > Ok, I think we'll give incremental repairs a try on a limited number of > CFs first and then if it goes well we'll progressively switch more CFs to > incremental. > > I'm not sure I understand the problem with anticompaction and validation > running concurrently. As far as I can tell, right now when a CF is repaired > (either via reaper, or via nodetool) there may be compactions running at > the same time. In fact, it happens very often. Is it a problem ? > > As far as big partitions, the biggest one we have is around 3.3Gb. Some > less big partitions are around 500Mb and less. > > > On Thu, Oct 27, 2016, at 05:37 PM, Alexander Dejanovski wrote: > > Oh right, that's what they advise :) > I'd say that you should skip the full repair phase in the migration > procedure as that will obviously fail, and just mark all sstables as > repaired (skip 1, 2 and 6). > Anyway you can't do better, so take a leap of faith there. > > Intensity is already very low and 1 segments is a whole lot for 9 > nodes, you should not need that many. > > You can definitely pick which CF you'll run incremental repair on, and > still run full repair on the rest. > If you pick our Reaper fork, watch out for schema changes that add > incremental repair fields, and I do not advise to run incremental repair > without it, otherwise you might have issues with anticompaction and > validation compactions running concurrently from time to time. > > One last thing : can you check if you have particularly big partitions in > the CFs that fail to get repaired ? You can run nodetool cfhistograms to > check that. > > Cheers, > > > > On Thu, Oct 27, 2016 at 5:24 PM Vincent Rischmann <m...@vrischmann.me> > wrote: > > > Thanks for the response. > > We do break up repairs between tables, we also tried our best to have no > overlap between repair runs. Each repair has 1 segments (purely > arbitrary number, seemed to help at the time). Some runs have an intensity > of 0.4, some have as low as 0.05. > > Still, sometimes one particular app (which does a lot of read/modify/write > batches in quorum) gets slowed down to the point we have to stop the repair > run. > > But more annoyingly, since 2 to 3 weeks as I said, it looks like runs > don't progress after some time. Every time I restart reaper, it starts to > repair correctly again, up until it gets stuck. I have no idea why that > happens now, but it means I have to baby sit reaper, and it's becoming > annoying. > > Thanks for the suggestion about incremental repairs. It would probably be > a good thing but it's a little challenging to setup I think. Right now > running a full repair of all keyspaces (via nodetool repair) is going to > take a lot of time, probably like 5 days or more. We were never able to run > one to completion. I'm not sure it's a good idea to disable autocompaction > for that long. > > But maybe I'm wrong. Is it possible to use incremental repairs on some > column family only ? > > > On Thu, Oct 27, 2016, at 05:02 PM, Alexander Dejanovski wrote: > > Hi Vincent, > > most people handle repair with : > - p
Re: Tools to manage repairs
3.3GB is already too high, and it's surely not good to have well performing compactions. Still I know changing a data model is no easy thing to do, but you should try to do something here. Anticompaction is a special type of compaction and if an sstable is being anticompacted, then any attempt to run validation compaction on it will fail, telling you that you cannot have an sstable being part of 2 repair sessions at the same time, so incremental repair must be run one node at a time, waiting for anticompactions to end before moving from one node to the other. Be mindful of running incremental repair on a regular basis once you started as you'll have two separate pools of sstables (repaired and unrepaired) that won't get compacted together, which could be a problem if you want tombstones to be purged efficiently. Cheers, Le jeu. 27 oct. 2016 17:57, Vincent Rischmann <m...@vrischmann.me> a écrit : > Ok, I think we'll give incremental repairs a try on a limited number of > CFs first and then if it goes well we'll progressively switch more CFs to > incremental. > > I'm not sure I understand the problem with anticompaction and validation > running concurrently. As far as I can tell, right now when a CF is repaired > (either via reaper, or via nodetool) there may be compactions running at > the same time. In fact, it happens very often. Is it a problem ? > > As far as big partitions, the biggest one we have is around 3.3Gb. Some > less big partitions are around 500Mb and less. > > > On Thu, Oct 27, 2016, at 05:37 PM, Alexander Dejanovski wrote: > > Oh right, that's what they advise :) > I'd say that you should skip the full repair phase in the migration > procedure as that will obviously fail, and just mark all sstables as > repaired (skip 1, 2 and 6). > Anyway you can't do better, so take a leap of faith there. > > Intensity is already very low and 1 segments is a whole lot for 9 > nodes, you should not need that many. > > You can definitely pick which CF you'll run incremental repair on, and > still run full repair on the rest. > If you pick our Reaper fork, watch out for schema changes that add > incremental repair fields, and I do not advise to run incremental repair > without it, otherwise you might have issues with anticompaction and > validation compactions running concurrently from time to time. > > One last thing : can you check if you have particularly big partitions in > the CFs that fail to get repaired ? You can run nodetool cfhistograms to > check that. > > Cheers, > > > > On Thu, Oct 27, 2016 at 5:24 PM Vincent Rischmann <m...@vrischmann.me> > wrote: > > > Thanks for the response. > > We do break up repairs between tables, we also tried our best to have no > overlap between repair runs. Each repair has 1 segments (purely > arbitrary number, seemed to help at the time). Some runs have an intensity > of 0.4, some have as low as 0.05. > > Still, sometimes one particular app (which does a lot of read/modify/write > batches in quorum) gets slowed down to the point we have to stop the repair > run. > > But more annoyingly, since 2 to 3 weeks as I said, it looks like runs > don't progress after some time. Every time I restart reaper, it starts to > repair correctly again, up until it gets stuck. I have no idea why that > happens now, but it means I have to baby sit reaper, and it's becoming > annoying. > > Thanks for the suggestion about incremental repairs. It would probably be > a good thing but it's a little challenging to setup I think. Right now > running a full repair of all keyspaces (via nodetool repair) is going to > take a lot of time, probably like 5 days or more. We were never able to run > one to completion. I'm not sure it's a good idea to disable autocompaction > for that long. > > But maybe I'm wrong. Is it possible to use incremental repairs on some > column family only ? > > > On Thu, Oct 27, 2016, at 05:02 PM, Alexander Dejanovski wrote: > > Hi Vincent, > > most people handle repair with : > - pain (by hand running nodetool commands) > - cassandra range repair : > https://github.com/BrianGallew/cassandra_range_repair > - Spotify Reaper > - and OpsCenter repair service for DSE users > > Reaper is a good option I think and you should stick to it. If it cannot > do the job here then no other tool will. > > You have several options from here : > >- Try to break up your repair table by table and see which ones >actually get stuck >- Check your logs for any repair/streaming error >- Avoid repairing everything : >- you may have expendable tables > - you may have TTLed only tables with no deletes, accessed with > QUORUM CL only > - You can try to r
Re: Tools to manage repairs
Oh right, that's what they advise :) I'd say that you should skip the full repair phase in the migration procedure as that will obviously fail, and just mark all sstables as repaired (skip 1, 2 and 6). Anyway you can't do better, so take a leap of faith there. Intensity is already very low and 1 segments is a whole lot for 9 nodes, you should not need that many. You can definitely pick which CF you'll run incremental repair on, and still run full repair on the rest. If you pick our Reaper fork, watch out for schema changes that add incremental repair fields, and I do not advise to run incremental repair without it, otherwise you might have issues with anticompaction and validation compactions running concurrently from time to time. One last thing : can you check if you have particularly big partitions in the CFs that fail to get repaired ? You can run nodetool cfhistograms to check that. Cheers, On Thu, Oct 27, 2016 at 5:24 PM Vincent Rischmann <m...@vrischmann.me> wrote: > Thanks for the response. > > We do break up repairs between tables, we also tried our best to have no > overlap between repair runs. Each repair has 1 segments (purely > arbitrary number, seemed to help at the time). Some runs have an intensity > of 0.4, some have as low as 0.05. > > Still, sometimes one particular app (which does a lot of read/modify/write > batches in quorum) gets slowed down to the point we have to stop the repair > run. > > But more annoyingly, since 2 to 3 weeks as I said, it looks like runs > don't progress after some time. Every time I restart reaper, it starts to > repair correctly again, up until it gets stuck. I have no idea why that > happens now, but it means I have to baby sit reaper, and it's becoming > annoying. > > Thanks for the suggestion about incremental repairs. It would probably be > a good thing but it's a little challenging to setup I think. Right now > running a full repair of all keyspaces (via nodetool repair) is going to > take a lot of time, probably like 5 days or more. We were never able to run > one to completion. I'm not sure it's a good idea to disable autocompaction > for that long. > > But maybe I'm wrong. Is it possible to use incremental repairs on some > column family only ? > > > On Thu, Oct 27, 2016, at 05:02 PM, Alexander Dejanovski wrote: > > Hi Vincent, > > most people handle repair with : > - pain (by hand running nodetool commands) > - cassandra range repair : > https://github.com/BrianGallew/cassandra_range_repair > - Spotify Reaper > - and OpsCenter repair service for DSE users > > Reaper is a good option I think and you should stick to it. If it cannot > do the job here then no other tool will. > > You have several options from here : > >- Try to break up your repair table by table and see which ones >actually get stuck >- Check your logs for any repair/streaming error >- Avoid repairing everything : >- you may have expendable tables > - you may have TTLed only tables with no deletes, accessed with > QUORUM CL only > - You can try to relieve repair pressure in Reaper by lowering >repair intensity (on the tables that get stuck) >- You can try adding steps to your repair process by putting a higher >segment count in reaper (on the tables that get stuck) >- And lastly, you can turn to incremental repair. As you're familiar >with Reaper already, you might want to take a look at our Reaper fork that >handles incremental repair : >https://github.com/thelastpickle/cassandra-reaper >If you go down that way, make sure you first mark all sstables as >repaired before you run your first incremental repair, otherwise you'll end >up in anticompaction hell (bad bad place) : > > https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html >Even if people say that's not necessary anymore, it'll save you from a >very bad first experience with incremental repair. >Furthermore, make sure you run repair daily after your first inc >repair run, in order to work on small sized repairs. > > > Cheers, > > > On Thu, Oct 27, 2016 at 4:27 PM Vincent Rischmann <m...@vrischmann.me> > wrote: > > > Hi, > > we have two Cassandra 2.1.15 clusters at work and are having some trouble > with repairs. > > Each cluster has 9 nodes, and the amount of data is not gigantic but some > column families have 300+Gb of data. > We tried to use `nodetool repair` for these tables but at the time we > tested it, it made the whole cluster load too much and it impacted our > production apps. > > Next we saw https://github.com/spotify/cassandra-reaper , tried it and > had some success until recen
Re: Tools to manage repairs
Hi Vincent, most people handle repair with : - pain (by hand running nodetool commands) - cassandra range repair : https://github.com/BrianGallew/cassandra_range_repair - Spotify Reaper - and OpsCenter repair service for DSE users Reaper is a good option I think and you should stick to it. If it cannot do the job here then no other tool will. You have several options from here : - Try to break up your repair table by table and see which ones actually get stuck - Check your logs for any repair/streaming error - Avoid repairing everything : - you may have expendable tables - you may have TTLed only tables with no deletes, accessed with QUORUM CL only - You can try to relieve repair pressure in Reaper by lowering repair intensity (on the tables that get stuck) - You can try adding steps to your repair process by putting a higher segment count in reaper (on the tables that get stuck) - And lastly, you can turn to incremental repair. As you're familiar with Reaper already, you might want to take a look at our Reaper fork that handles incremental repair : https://github.com/thelastpickle/cassandra-reaper If you go down that way, make sure you first mark all sstables as repaired before you run your first incremental repair, otherwise you'll end up in anticompaction hell (bad bad place) : https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html Even if people say that's not necessary anymore, it'll save you from a very bad first experience with incremental repair. Furthermore, make sure you run repair daily after your first inc repair run, in order to work on small sized repairs. Cheers, On Thu, Oct 27, 2016 at 4:27 PM Vincent Rischmann <m...@vrischmann.me> wrote: Hi, we have two Cassandra 2.1.15 clusters at work and are having some trouble with repairs. Each cluster has 9 nodes, and the amount of data is not gigantic but some column families have 300+Gb of data. We tried to use `nodetool repair` for these tables but at the time we tested it, it made the whole cluster load too much and it impacted our production apps. Next we saw https://github.com/spotify/cassandra-reaper , tried it and had some success until recently. Since 2 to 3 weeks it never completes a repair run, deadlocking itself somehow. I know DSE includes a repair service but I'm wondering how do other Cassandra users manage repairs ? Vincent. -- - Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Re: incremental repairs with -pr flag?
Hi Sean, In order to mitigate its impact, anticompaction is not fully executed when incremental repair is run with -pr. What you'll observe is that running repair on all nodes with -pr will leave sstables marked as unrepaired on all of them. Then, if you think about it you realize it's no big deal as -pr is useless with incremental repair : data is repaired only once with incremental repair, which is what -pr intended to fix on full repair, by repairing all token ranges only once instead of times the replication factor. Cheers, Le lun. 24 oct. 2016 18:05, Sean Bridges <sean.brid...@globalrelay.net> a écrit : > Hey, > > In the datastax documentation on repair [1], it says, > > "The partitioner range option is recommended for routine maintenance. Do > not use it to repair a downed node. Do not use with incremental repair > (default for Cassandra 3.0 and later)." > > Why is it not recommended to use -pr with incremental repairs? > > Thanks, > > Sean > > [1] > https://docs.datastax.com/en/cassandra/3.x/cassandra/operations/opsRepairNodesManualRepair.html > -- > > Sean Bridges > > senior systems architect > Global Relay > > *sean.brid...@globalrelay.net* <sean.brid...@globalrelay.net> > > *866.484.6630 * > New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore > (+65.3158.1301) > > Global Relay Archive supports email, instant messaging, BlackBerry, > Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, > Facebook and more. > > Ask about *Global Relay Message* > <http://www.globalrelay.com/services/message> - The Future of > Collaboration in the Financial Services World > > All email sent to or from this address will be retained by Global Relay's > email archiving system. This message is intended only for the use of the > individual or entity to which it is addressed, and may contain information > that is privileged, confidential, and exempt from disclosure under > applicable law. Global Relay will not be liable for any compliance or > technical information provided herein. All trademarks are the property of > their respective owners. > > -- - Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Re: non incremental repairs with cassandra 2.2+
Hi Kurt, we're not actually. Reaper performs full repair by subrange but does incremental repair on all ranges at once, node by node. Subrange is incompatible with incremental repair anyway. Cheers, On Thu, Oct 20, 2016 at 5:24 AM kurt Greaves <k...@instaclustr.com> wrote: > > On 19 October 2016 at 17:13, Alexander Dejanovski <a...@thelastpickle.com> > wrote: > > There aren't that many tools I know to orchestrate repairs and we maintain > a fork of Reaper, that was made by Spotify, and handles incremental repair > : https://github.com/thelastpickle/cassandra-reaper > > > Looks like you're using subranges with incremental repairs. This will > generate a lot of anticompactions as you'll only repair a portion of the > SSTables. You should use forceRepairAsync for incremental repairs so that > it's possible for the repair to act on the whole SSTable, minimising > anticompactions. > > Kurt Greaves > k...@instaclustr.com > www.instaclustr.com > -- - Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Re: non incremental repairs with cassandra 2.2+
There aren't that many tools I know to orchestrate repairs and we maintain a fork of Reaper, that was made by Spotify, and handles incremental repair : https://github.com/thelastpickle/cassandra-reaper We just added Cassandra as storage back end (only postgres currently) in one of the branches, which should soon be merged to master. Le mer. 19 oct. 2016 19:03, Kant Kodali <k...@peernova.com> a écrit : Also any suggestions on a tool to orchestrate the incremental repair? Like say most commonly used Sent from my iPhone On Oct 19, 2016, at 9:54 AM, Alexander Dejanovski <a...@thelastpickle.com> wrote: Hi Kant, subrange is a form of full repair, so it will just split the repair process in smaller yet sequential pieces of work (repair is started giving a start and end token). Overall, you should not expect improvements other than having less overstreaming and better chances of success if your cluster is dense. You can try to use incremental repair if you know what the caveats are and use a proper tool to orchestrate it, that would save you from repairing all 10TB each time. CASSANDRA-12580 might help too as Romain showed us : https://www.mail-archive.com/user@cassandra.apache.org/msg49344.html Cheers, On Wed, Oct 19, 2016 at 6:42 PM Kant Kodali <k...@peernova.com> wrote: Another question on a same note would be what would be the fastest way to do repairs of size 10TB cluster ? Full repairs are taking days. So among repair parallel or repair sub range which is faster in the case of say adding a new node to the cluster? Sent from my iPhone On Oct 19, 2016, at 9:30 AM, Sean Bridges <sean.brid...@globalrelay.net> wrote: Hey, We are upgrading from cassandra 2.1 to cassandra 2.2. With cassandra 2.1 we would periodically repair all nodes, using the -pr flag. With cassandra 2.2, the same repair takes a very long time, as cassandra does an anti compaction after the repair. This anti compaction causes most (all?) the sstables to be rewritten. Is there a way to do full repairs without continually anti compacting? If we do a full repair on each node with the -pr flag, will subsequent full repairs also force anti compacting most (all?) sstables? Thanks, Sean -- --------- Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com -- --------- Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Re: non incremental repairs with cassandra 2.2+
Can you explain why you would want to run repair for new nodes? Aren't you talking about bootstrap, which is not related to repair actually? Le mer. 19 oct. 2016 18:57, Kant Kodali <k...@peernova.com> a écrit : > Thanks! How do I do an incremental repair when I add a new node? > > Sent from my iPhone > > On Oct 19, 2016, at 9:54 AM, Alexander Dejanovski <a...@thelastpickle.com> > wrote: > > Hi Kant, > > subrange is a form of full repair, so it will just split the repair > process in smaller yet sequential pieces of work (repair is started giving > a start and end token). Overall, you should not expect improvements other > than having less overstreaming and better chances of success if your > cluster is dense. > > You can try to use incremental repair if you know what the caveats are and > use a proper tool to orchestrate it, that would save you from repairing all > 10TB each time. > CASSANDRA-12580 might help too as Romain showed us : > https://www.mail-archive.com/user@cassandra.apache.org/msg49344.html > > Cheers, > > > > On Wed, Oct 19, 2016 at 6:42 PM Kant Kodali <k...@peernova.com> wrote: > > Another question on a same note would be what would be the fastest way to > do repairs of size 10TB cluster ? Full repairs are taking days. So among > repair parallel or repair sub range which is faster in the case of say > adding a new node to the cluster? > > Sent from my iPhone > > On Oct 19, 2016, at 9:30 AM, Sean Bridges <sean.brid...@globalrelay.net> > wrote: > > Hey, > > We are upgrading from cassandra 2.1 to cassandra 2.2. > > With cassandra 2.1 we would periodically repair all nodes, using the -pr > flag. > > With cassandra 2.2, the same repair takes a very long time, as cassandra > does an anti compaction after the repair. This anti compaction causes most > (all?) the sstables to be rewritten. Is there a way to do full repairs > without continually anti compacting? If we do a full repair on each node > with the -pr flag, will subsequent full repairs also force anti compacting > most (all?) sstables? > > Thanks, > > Sean > > -- > - > Alexander Dejanovski > France > @alexanderdeja > > Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com > > -- - Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Re: non incremental repairs with cassandra 2.2+
Hi Kant, subrange is a form of full repair, so it will just split the repair process in smaller yet sequential pieces of work (repair is started giving a start and end token). Overall, you should not expect improvements other than having less overstreaming and better chances of success if your cluster is dense. You can try to use incremental repair if you know what the caveats are and use a proper tool to orchestrate it, that would save you from repairing all 10TB each time. CASSANDRA-12580 might help too as Romain showed us : https://www.mail-archive.com/user@cassandra.apache.org/msg49344.html Cheers, On Wed, Oct 19, 2016 at 6:42 PM Kant Kodali <k...@peernova.com> wrote: Another question on a same note would be what would be the fastest way to do repairs of size 10TB cluster ? Full repairs are taking days. So among repair parallel or repair sub range which is faster in the case of say adding a new node to the cluster? Sent from my iPhone On Oct 19, 2016, at 9:30 AM, Sean Bridges <sean.brid...@globalrelay.net> wrote: Hey, We are upgrading from cassandra 2.1 to cassandra 2.2. With cassandra 2.1 we would periodically repair all nodes, using the -pr flag. With cassandra 2.2, the same repair takes a very long time, as cassandra does an anti compaction after the repair. This anti compaction causes most (all?) the sstables to be rewritten. Is there a way to do full repairs without continually anti compacting? If we do a full repair on each node with the -pr flag, will subsequent full repairs also force anti compacting most (all?) sstables? Thanks, Sean -- ----- Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Re: non incremental repairs with cassandra 2.2+
Hi Sean, you should be able to do that by running subrange repairs, which is the only type of repair that wouldn't trigger anticompaction AFAIK. Beware that now you will have sstables marked as repaired and others marked as unrepaired, which will never be compacted together. You might want to flag all sstables as unrepaired before moving on, if you do not intend to switch to incremental repair for now. Cheers, On Wed, Oct 19, 2016 at 6:31 PM Sean Bridges <sean.brid...@globalrelay.net> wrote: > Hey, > > We are upgrading from cassandra 2.1 to cassandra 2.2. > > With cassandra 2.1 we would periodically repair all nodes, using the -pr > flag. > > With cassandra 2.2, the same repair takes a very long time, as cassandra > does an anti compaction after the repair. This anti compaction causes most > (all?) the sstables to be rewritten. Is there a way to do full repairs > without continually anti compacting? If we do a full repair on each node > with the -pr flag, will subsequent full repairs also force anti compacting > most (all?) sstables? > > Thanks, > > Sean > -- - Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Re: problem starting incremental repair using TheLastPicke Reaper
Abhishek, can you file an issue on our github repo so that we can further discuss this ? https://github.com/thelastpickle/cassandra-reaper/issues Thanks, On Wed, Oct 19, 2016 at 1:20 PM Abhishek Aggarwal < abhishek.aggarwa...@snapdeal.com> wrote: > Hi Alex, > > that i already did and it worked but my question is if the passed value of > incremental repair flag is different from the existing value then it > should allow to create new repair_unit instead of getting repair_unit based > on cluster name/ keyspace /column combination. > > and also if i delete the repair_unit then due to referential constraints i > need to delete repair_segment and repair_run as well which will delete the > run history corresponds to the repaid_unit. > > Abhishek Aggarwal > > *Senior Software Engineer* > *M*: +91 8861212073 <+91%2088612%2012073> , 8588840304 > *T*: 0124 6600600 *EXT*: 12128 > ASF Center -A, ASF Center Udyog Vihar Phase IV, > Download Our App > [image: A] > <https://play.google.com/store/apps/details?id=com.snapdeal.main_source=mobileAppLp_campaign=android> > [image: > A] > <https://itunes.apple.com/in/app/snapdeal-mobile-shopping/id721124909?ls=1=8_source=mobileAppLp_campaign=ios> > [image: > W] > <http://www.windowsphone.com/en-in/store/app/snapdeal/ee17fccf-40d0-4a59-80a3-04da47a5553f> > > On Wed, Oct 19, 2016 at 4:44 PM, Alexander Dejanovski < > a...@thelastpickle.com> wrote: > > Hi Abhishek, > > This shows you have two repair units for the same keyspace/table with > different incremental repair settings. > Can you delete your prior repair run (the one with incremental repair set > to false) and then create the new one with incremental repair set to true ? > > Let me know how that works, > > > On Wed, Oct 19, 2016 at 10:45 AM Abhishek Aggarwal < > abhishek.aggarwa...@snapdeal.com> wrote: > > > is there a way to start the incremental repair using the reaper. we > completed full repair successfully and after that i tried to run the > incremental run but getting the below error. > > > A repair run already exist for the same cluster/keyspace/table but with a > different incremental repair value.Requested value: true | Existing value: > false > > > -- > - > Alexander Dejanovski > France > @alexanderdeja > > Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com > > > -- - Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Re: problem starting incremental repair using TheLastPicke Reaper
Hi Abhishek, This shows you have two repair units for the same keyspace/table with different incremental repair settings. Can you delete your prior repair run (the one with incremental repair set to false) and then create the new one with incremental repair set to true ? Let me know how that works, On Wed, Oct 19, 2016 at 10:45 AM Abhishek Aggarwal < abhishek.aggarwa...@snapdeal.com> wrote: > > is there a way to start the incremental repair using the reaper. we > completed full repair successfully and after that i tried to run the > incremental run but getting the below error. > > > A repair run already exist for the same cluster/keyspace/table but with a > different incremental repair value.Requested value: true | Existing value: > false > > > -- - Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Re: [cassandra 3.6.] Nodetool Repair + tombstone behaviour
Atul, our fork has been tested on 2.1 and 3.0.x clusters. I've just tested with a CCM 3.6 cluster and it worked with no issue. With Reaper, if you set incremental to false, it'll perform a full subrange repair with no anticompaction. You'll see this message in the logs : INFO [AntiEntropyStage:1] 2016-09-29 16:11:34,950 ActiveRepairService.java:378 - Not a global repair, will not do anticompaction If you set incremental to true, it'll perform an incremental repair, one node at a time, with anticompaction (set Parallelism to Parallel exclusively with inc repair). Let me know how it goes. On Thu, Sep 29, 2016 at 3:06 PM Atul Saroha <atul.sar...@snapdeal.com> wrote: > Hi Alexander, > > There is compatibility issue raised with spotify/cassandra-reaper for > cassandra version 3.x. > Is it comaptible with 3.6 in fork thelastpickle/cassandra-reaper ? > > There are some suggestions mentioned by *brstgt* which we can try on our > side. > > On Thu, Sep 29, 2016 at 5:42 PM, Atul Saroha <atul.sar...@snapdeal.com> > wrote: > >> Thanks Alexander. >> >> Will look into all these. >> >> On Thu, Sep 29, 2016 at 4:39 PM, Alexander Dejanovski < >> a...@thelastpickle.com> wrote: >> >>> Atul, >>> >>> since you're using 3.6, by default you're running incremental repair, >>> which doesn't like concurrency very much. >>> Validation errors are not occurring on a partition or partition range >>> base, but if you're trying to run both anticompaction and validation >>> compaction on the same SSTable. >>> >>> Like advised to Robert yesterday, and if you want to keep on running >>> incremental repair, I'd suggest the following : >>> >>>- run nodetool tpstats on all nodes in search for running/pending >>>repair sessions >>>- If you have some, and to be sure you will avoid conflicts, roll >>>restart your cluster (all nodes) >>>- Then, run "nodetool repair" on one node. >>>- When repair has finished on this node (track messages in the log >>>and nodetool tpstats), check if other nodes are running anticompactions >>>- If so, wait until they are over >>>- If not, move on to the other node >>> >>> You should be able to run concurrent incremental compactions on >>> different tables if you wish to speed up the complete repair of the >>> cluster, but do not try to repair the same table/full keyspace from two >>> nodes at the same time. >>> >>> If you do not want to keep on using incremental repair, and fallback to >>> classic full repair, I think the only way in 3.6 to avoid anticompaction >>> will be to use subrange repair (Paulo mentioned that in 3.x full repair >>> also triggers anticompaction). >>> >>> You have two options here : cassandra_range_repair ( >>> https://github.com/BrianGallew/cassandra_range_repair) and Spotify >>> Reaper (https://github.com/spotify/cassandra-reaper) >>> >>> cassandra_range_repair might scream about subrange + incremental not >>> being compatible (not sure here), but you can modify the repair_range() >>> method by adding a --full switch to the command line used to run repair. >>> >>> We have a fork of Reaper that handles both full subrange repair and >>> incremental repair here : >>> https://github.com/thelastpickle/cassandra-reaper >>> It comes with a tweaked version of the UI made by Stephan Podkowinski ( >>> https://github.com/spodkowinski/cassandra-reaper-ui) - that eases >>> interactions to schedule, run and track repair - which adds fields to run >>> incremental repair (accessible via ...:8080/webui/ in your browser). >>> >>> Cheers, >>> >>> >>> >>> On Thu, Sep 29, 2016 at 12:33 PM Atul Saroha <atul.sar...@snapdeal.com> >>> wrote: >>> >>>> Hi, >>>> >>>> We are not sure whether this issue is linked to that node or not. Our >>>> application does frequent delete and insert. >>>> >>>> May be our approach is not correct for nodetool repair. Yes, we >>>> generally fire repair on all boxes at same time. Till now, it was manual >>>> with default configuration ( command: "nodetool repair"). >>>> Yes, we saw validation error but that is linked to already running >>>> repair of same partition on other box for same partition range. We saw >>>> error validation failed with some ip as repair in already running for the >>>> same SSTa
Re: [cassandra 3.6.] Nodetool Repair + tombstone behaviour
Atul, since you're using 3.6, by default you're running incremental repair, which doesn't like concurrency very much. Validation errors are not occurring on a partition or partition range base, but if you're trying to run both anticompaction and validation compaction on the same SSTable. Like advised to Robert yesterday, and if you want to keep on running incremental repair, I'd suggest the following : - run nodetool tpstats on all nodes in search for running/pending repair sessions - If you have some, and to be sure you will avoid conflicts, roll restart your cluster (all nodes) - Then, run "nodetool repair" on one node. - When repair has finished on this node (track messages in the log and nodetool tpstats), check if other nodes are running anticompactions - If so, wait until they are over - If not, move on to the other node You should be able to run concurrent incremental compactions on different tables if you wish to speed up the complete repair of the cluster, but do not try to repair the same table/full keyspace from two nodes at the same time. If you do not want to keep on using incremental repair, and fallback to classic full repair, I think the only way in 3.6 to avoid anticompaction will be to use subrange repair (Paulo mentioned that in 3.x full repair also triggers anticompaction). You have two options here : cassandra_range_repair ( https://github.com/BrianGallew/cassandra_range_repair) and Spotify Reaper ( https://github.com/spotify/cassandra-reaper) cassandra_range_repair might scream about subrange + incremental not being compatible (not sure here), but you can modify the repair_range() method by adding a --full switch to the command line used to run repair. We have a fork of Reaper that handles both full subrange repair and incremental repair here : https://github.com/thelastpickle/cassandra-reaper It comes with a tweaked version of the UI made by Stephan Podkowinski ( https://github.com/spodkowinski/cassandra-reaper-ui) - that eases interactions to schedule, run and track repair - which adds fields to run incremental repair (accessible via ...:8080/webui/ in your browser). Cheers, On Thu, Sep 29, 2016 at 12:33 PM Atul Saroha <atul.sar...@snapdeal.com> wrote: > Hi, > > We are not sure whether this issue is linked to that node or not. Our > application does frequent delete and insert. > > May be our approach is not correct for nodetool repair. Yes, we generally > fire repair on all boxes at same time. Till now, it was manual with default > configuration ( command: "nodetool repair"). > Yes, we saw validation error but that is linked to already running repair > of same partition on other box for same partition range. We saw error > validation failed with some ip as repair in already running for the same > SSTable. > Just few days back, we had 2 DCs with 3 nodes each and replication was > also 3. It means all data on each node. > > On Thu, Sep 29, 2016 at 2:49 PM, Alexander Dejanovski < > a...@thelastpickle.com> wrote: > >> Hi Atul, >> >> could you be more specific on how you are running repair ? What's the >> precise command line for that, does it run on several nodes at the same >> time, etc... >> What is your gc_grace_seconds ? >> Do you see errors in your logs that would be linked to repairs >> (Validation failure or failure to create a merkle tree)? >> >> You seem to mention a single node that went down but say the whole >> cluster seem to have zombie data. >> What is the connection you see between the node that went down and the >> fact that deleted data comes back to life ? >> What is your strategy for cyclic maintenance repair (schedule, command >> line or tool, etc...) ? >> >> Thanks, >> >> On Thu, Sep 29, 2016 at 10:40 AM Atul Saroha <atul.sar...@snapdeal.com> >> wrote: >> >>> Hi, >>> >>> We have seen a weird behaviour in cassandra 3.6. >>> Once our node was went down more than 10 hrs. After that, we had ran >>> Nodetool repair multiple times. But tombstone are not getting sync properly >>> over the cluster. On day- today basis, on expiry of every grace period, >>> deleted records start surfacing again in cassandra. >>> >>> It seems Nodetool repair in not syncing tomebstone across cluster. >>> FYI, we have 3 data centres now. >>> >>> Just want the help how to verify and debug this issue. Help will be >>> appreciated. >>> >>> >>> -- >>> Regards, >>> Atul Saroha >>> >>> *Lead Software Engineer | CAMS* >>> >>> M: +91 8447784271 >>> Plot #362, ASF Center - Tower A, 1st Floor, Sec-18, >>> Udyog Vihar Phase IV,Gurg
Re: WARN Writing large partition for materialized views
Hi Robert, Materialized Views are regular C* tables underneath, so based on their PK they can generate big partitions. It is often advised to keep partition size under 100MB because larger partitions are hard to read and compact. They usually put pressure on the heap and lead to long GC pauses + laggy compactions. You could possibly OOM while trying to fully read a partition that is way too big for your heap. It is indeed a schema problem and you most likely have to bucket your MV in order to split those partitions into smaller chunks. In the case of MV, you possibly need to add a bucketing field to the table it relies on (if you don't have one already), and add it to the MV partition key. You should try to use cassandra-stress to test your bucket sizes : https://docs.datastax.com/en/cassandra/3.x/cassandra/tools/toolsCStress.html In your schema definition you can now specify the creation of a MV. Cheers, On Wed, Sep 28, 2016 at 7:35 PM Robert Sicoie <robert.sic...@gmail.com> wrote: > Hi guys, > > I run a cluster with 5 nodes, cassandra version 3.0.5. > > I get this warning: > 2016-09-28 17:22:18,480 BigTableWriter.java:171 - Writing large > partition... > > for some materialized view. Some have values over 500MB. How this affects > performance? What can/should be done? I suppose is a problem in the schema > design. > > Thanks, > Robert Sicoie > -- - Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Re: [cassandra 3.6.] Nodetool Repair + tombstone behaviour
Hi Atul, could you be more specific on how you are running repair ? What's the precise command line for that, does it run on several nodes at the same time, etc... What is your gc_grace_seconds ? Do you see errors in your logs that would be linked to repairs (Validation failure or failure to create a merkle tree)? You seem to mention a single node that went down but say the whole cluster seem to have zombie data. What is the connection you see between the node that went down and the fact that deleted data comes back to life ? What is your strategy for cyclic maintenance repair (schedule, command line or tool, etc...) ? Thanks, On Thu, Sep 29, 2016 at 10:40 AM Atul Saroha <atul.sar...@snapdeal.com> wrote: > Hi, > > We have seen a weird behaviour in cassandra 3.6. > Once our node was went down more than 10 hrs. After that, we had ran > Nodetool repair multiple times. But tombstone are not getting sync properly > over the cluster. On day- today basis, on expiry of every grace period, > deleted records start surfacing again in cassandra. > > It seems Nodetool repair in not syncing tomebstone across cluster. > FYI, we have 3 data centres now. > > Just want the help how to verify and debug this issue. Help will be > appreciated. > > > -- > Regards, > Atul Saroha > > *Lead Software Engineer | CAMS* > > M: +91 8447784271 > Plot #362, ASF Center - Tower A, 1st Floor, Sec-18, > Udyog Vihar Phase IV,Gurgaon, Haryana, India > > -- - Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Re: How to get rid of "Cannot start multiple repair sessions over the same sstables" exception
Robert, You can restart them in any order, that doesn't make a difference afaik. Cheers Le mer. 28 sept. 2016 17:10, Robert Sicoie <robert.sic...@gmail.com> a écrit : > Thanks Alexander, > > Yes, with tpstats I can see the hanging active repair(s) (output > attached). For one there are 31 pending repair. On others there are less > pending repairs (min 12). Is there any recomandation for the restart order? > The one with more less pending repairs first, perhaps? > > Thanks, > Robert > > Robert Sicoie > > On Wed, Sep 28, 2016 at 5:35 PM, Alexander Dejanovski < > a...@thelastpickle.com> wrote: > >> They will show up in nodetool compactionstats : >> https://issues.apache.org/jira/browse/CASSANDRA-9098 >> >> Did you check nodetool tpstats to see if you didn't have any running >> repair session ? >> Just to make sure (and if you can actually do it), roll restart the >> cluster and try again. Repair sessions can get sticky sometimes. >> >> On Wed, Sep 28, 2016 at 4:23 PM Robert Sicoie <robert.sic...@gmail.com> >> wrote: >> >>> I am using nodetool compactionstats to check for pending compactions and >>> it shows me 0 pending on all nodes, seconds before running nodetool repair. >>> I am also monitoring PendingCompactions on jmx. >>> >>> Is there other way I can find out if is there any anticompaction running >>> on any node? >>> >>> Thanks a lot, >>> Robert >>> >>> Robert Sicoie >>> >>> On Wed, Sep 28, 2016 at 4:44 PM, Alexander Dejanovski < >>> a...@thelastpickle.com> wrote: >>> >>>> Robert, >>>> >>>> you need to make sure you have no repair session currently running on >>>> your cluster, and no anticompaction. >>>> I'd recommend doing a rolling restart in order to stop all running >>>> repair for sure, then start the process again, node by node, checking that >>>> no anticompaction is running before moving from one node to the other. >>>> >>>> Please do not use the -pr switch as it is both useless (token ranges >>>> are repaired only once with inc repair, whatever the replication factor) >>>> and harmful as all anticompactions won't be executed (you'll still have >>>> sstables marked as unrepaired even if the process has ran entirely with no >>>> error). >>>> >>>> Let us know how that goes. >>>> >>>> Cheers, >>>> >>>> On Wed, Sep 28, 2016 at 2:57 PM Robert Sicoie <robert.sic...@gmail.com> >>>> wrote: >>>> >>>>> Thanks Alexander, >>>>> >>>>> Now I started to run the repair with -pr arg and with keyspace and >>>>> table args. >>>>> Still, I got the "ERROR [RepairJobTask:1] 2016-09-28 11:34:38,288 >>>>> RepairRunnable.java:246 - Repair session >>>>> 89af4d10-856f-11e6-b28f-df99132d7979 for range >>>>> [(8323429577695061526,8326640819362122791], >>>>> ..., (4212695343340915405,4229348077081465596]]] Validation failed in / >>>>> 10.45.113.88" >>>>> >>>>> for one of the tables. 10.45.113.88 is the ip of the machine I am >>>>> running the nodetool on. >>>>> I'm wondering if this is normal... >>>>> >>>>> Thanks, >>>>> Robert >>>>> >>>>> >>>>> >>>>> >>>>> Robert Sicoie >>>>> >>>>> On Wed, Sep 28, 2016 at 11:53 AM, Alexander Dejanovski < >>>>> a...@thelastpickle.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> nodetool scrub won't help here, as what you're experiencing is most >>>>>> likely that one SSTable is going through anticompaction, and then another >>>>>> node is asking for a Merkle tree that involves it. >>>>>> For understandable reasons, an SSTable cannot be anticompacted and >>>>>> validation compacted at the same time. >>>>>> >>>>>> The solution here is to adjust the repair pressure on your cluster so >>>>>> that anticompaction can end before you run repair on another node. >>>>>> You may have a lot of anticompaction to do if you had high volumes of >>>>>> unrepaired data, which can take a long time depending on several factors. >>>>>> >>>>>&g
Re: How to get rid of "Cannot start multiple repair sessions over the same sstables" exception
They will show up in nodetool compactionstats : https://issues.apache.org/jira/browse/CASSANDRA-9098 Did you check nodetool tpstats to see if you didn't have any running repair session ? Just to make sure (and if you can actually do it), roll restart the cluster and try again. Repair sessions can get sticky sometimes. On Wed, Sep 28, 2016 at 4:23 PM Robert Sicoie <robert.sic...@gmail.com> wrote: > I am using nodetool compactionstats to check for pending compactions and > it shows me 0 pending on all nodes, seconds before running nodetool repair. > I am also monitoring PendingCompactions on jmx. > > Is there other way I can find out if is there any anticompaction running > on any node? > > Thanks a lot, > Robert > > Robert Sicoie > > On Wed, Sep 28, 2016 at 4:44 PM, Alexander Dejanovski < > a...@thelastpickle.com> wrote: > >> Robert, >> >> you need to make sure you have no repair session currently running on >> your cluster, and no anticompaction. >> I'd recommend doing a rolling restart in order to stop all running repair >> for sure, then start the process again, node by node, checking that no >> anticompaction is running before moving from one node to the other. >> >> Please do not use the -pr switch as it is both useless (token ranges are >> repaired only once with inc repair, whatever the replication factor) and >> harmful as all anticompactions won't be executed (you'll still have >> sstables marked as unrepaired even if the process has ran entirely with no >> error). >> >> Let us know how that goes. >> >> Cheers, >> >> On Wed, Sep 28, 2016 at 2:57 PM Robert Sicoie <robert.sic...@gmail.com> >> wrote: >> >>> Thanks Alexander, >>> >>> Now I started to run the repair with -pr arg and with keyspace and table >>> args. >>> Still, I got the "ERROR [RepairJobTask:1] 2016-09-28 11:34:38,288 >>> RepairRunnable.java:246 - Repair session >>> 89af4d10-856f-11e6-b28f-df99132d7979 for range >>> [(8323429577695061526,8326640819362122791], >>> ..., (4212695343340915405,4229348077081465596]]] Validation failed in / >>> 10.45.113.88" >>> >>> for one of the tables. 10.45.113.88 is the ip of the machine I am >>> running the nodetool on. >>> I'm wondering if this is normal... >>> >>> Thanks, >>> Robert >>> >>> >>> >>> >>> Robert Sicoie >>> >>> On Wed, Sep 28, 2016 at 11:53 AM, Alexander Dejanovski < >>> a...@thelastpickle.com> wrote: >>> >>>> Hi, >>>> >>>> nodetool scrub won't help here, as what you're experiencing is most >>>> likely that one SSTable is going through anticompaction, and then another >>>> node is asking for a Merkle tree that involves it. >>>> For understandable reasons, an SSTable cannot be anticompacted and >>>> validation compacted at the same time. >>>> >>>> The solution here is to adjust the repair pressure on your cluster so >>>> that anticompaction can end before you run repair on another node. >>>> You may have a lot of anticompaction to do if you had high volumes of >>>> unrepaired data, which can take a long time depending on several factors. >>>> >>>> You can tune your repair process to make sure no anticompaction is >>>> running before launching a new session on another node or you can try my >>>> Reaper fork that handles incremental repair : >>>> https://github.com/adejanovski/cassandra-reaper/tree/inc-repair-support-with-ui >>>> I may have to add a few checks in order to avoid all collisions between >>>> anticompactions and new sessions, but it should be helpful if you struggle >>>> with incremental repair. >>>> >>>> In any case, check if your nodes are still anticompacting before trying >>>> to run a new repair session on a node. >>>> >>>> Cheers, >>>> >>>> >>>> On Wed, Sep 28, 2016 at 10:31 AM Robert Sicoie <robert.sic...@gmail.com> >>>> wrote: >>>> >>>>> Hi guys, >>>>> >>>>> I have a cluster of 5 nodes, cassandra 3.0.5. >>>>> I was running nodetool repair last days, one node at a time, when I >>>>> first encountered this exception >>>>> >>>>> *ERROR [ValidationExecutor:11] 2016-09-27 16:12:20,409 >>>>> CassandraDaemon.java:195 - Exception in thread >>
Re: How to get rid of "Cannot start multiple repair sessions over the same sstables" exception
k.java:56) > ~[apache-cassandra-3.0.5.jar:3.0.5]* > * at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > ~[na:1.8.0_60]* > * at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > ~[na:1.8.0_60]* > * at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_60]* > *Caused by: java.lang.InterruptedException: null* > * at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) > ~[na:1.8.0_60]* > * at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) > ~[na:1.8.0_60]* > * at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) > ~[na:1.8.0_60]* > * at > org.apache.cassandra.net.OutboundTcpConnection.enqueue(OutboundTcpConnection.java:168) > ~[apache-cassandra-3.0.5.jar:3.0.5]* > * ... 6 common frames omitted* > > > Now if I run nodetool repair I get the > > *java.lang.RuntimeException: Cannot start multiple repair sessions over > the same sstables* > > exception. > What do you suggest? would nodetool scrub or sstablescrub help in this > case. or it would just make it worse? > > Thanks, > > Robert > -- - Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Re: race condition for quorum consistency
I haven't been very accurate in my first answer indeed, which was misleading. Apache Cassandra guarantees that if all queries are ran at least at quorum, a client writing successfully (as in the cluster acknowledged the write) then reading his previous write will see the correct value unless another client updated it between the write and the read (which would be a race condition). Same goes for two different clients if the first issues a successful write and only after that the second reads the value. Quorum provides consistency guaranty if queries are fired in sequence. Without diving into complex scenarios where it may work because of read repair and the fact that everything is async, Ken's use case was : C1 writes, it is not successful yet, C2 and C3 read at the approx. same time. Once again, in this case C2 and C3 could be reading a different value as C1's mutation could be in pending state on some nodes. Considering we have nodes A, B and C : - Node A has received the write from C1, nodes B and C have not - C2 reads from A and B, there's a digest mismatch which triggers a foreground read repair (background read repairs are triggered at CL ONE) > it gets the up to date value that was written by C1 - C3 reads from B and C, there's no digest mismatch and the value is not up to date with A > it does not get the value written by C1 Cheers, On Thu, Sep 15, 2016 at 12:10 AM Tyler Hobbs <ty...@datastax.com> wrote: > > On Wed, Sep 14, 2016 at 3:49 PM, Nicolas Douillet < > nicolas.douil...@gmail.com> wrote: > >> >>- >>- during read requests, cassandra will ask to one node the data and >>to the others involved in the CL a digest, and if all digests do not >> match, >>will ask for them the entire data, handle the merge and finally will ask >> to >>those nodes a background repair. Your write may have succeed during this >>time. >> >> > This is very good info, but as a minor correction, the repair here will > happen in the foreground before the response is returned to the client. > So, at least from a single client's perspective, you get monotonic reads. > > > -- > Tyler Hobbs > DataStax <http://datastax.com/> > -- - Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Re: race condition for quorum consistency
My understanding of the described scenario is that the write hasn't succeeded when reads are fired, as B and C haven't processed the mutation yet. There would be 3 clients here and not 2 : C1 writes, C2 and C3 read. So the race condition could still happen in this particular case. Le mer. 14 sept. 2016 21:07, Work <jrother...@codojo.me> a écrit : > Hi Alex: > > Hmmm ... Assuming clock skew is eliminated And assuming nodes are up > and available ... And assuming quorum writes and quorum reads and everyone > waiting for success ( which is NOT The OP scenario), Two different clients > will be guaranteed to see all successful writes, or be told that read > failed. > > C1 writes at quorum to A,B > C2 reads at quorum. > So it tries to read from ALL nodes, A,B, C. > If A,B respond --> success > If A,C respond --> conflict > If B, C respond --> conflict > Because a quorum (2 nodes) responded, the coordinator will return the > latest time stamp and may issue read repair depending on YAML settings. > > So where do you see only one client having this guarantee? > > Regards, > > James > > On Sep 14, 2016, at 4:00 AM, Alexander DEJANOVSKI <adejanov...@gmail.com> > wrote: > > Hi, > > the analysis is valid, and strong consistency the Cassandra way means that > one client writing at quorum, then reading at quorum will always see his > previous write. > Two different clients have no guarantee to see the same data when using > quorum, as illustrated in your example. > > Only options here are to route requests to specific clients based on some > id to guarantee the sequence of operations outside of Cassandra (the same > client will always be responsible for a set of ids), or raise the CL to ALL > at the expense of availability (you should not do that). > > > Cheers, > > Alex > > Le mer. 14 sept. 2016 à 11:47, Qi Li <ken.l...@gmail.com> a écrit : > >> hi all, >> >> we are using quorum consistency, and we *suspect* there may be a race >> condition during the write. lets say RF is 3. so write will wait for at >> least 2 nodes to ack. suppose there is only 1 node acked(node A). the other >> 2 nodes(node B, C) are still waiting to update. there come two read requests >> one read is having the data responded from the node B and C, so version 1 >> us returned. >> the other node is having data responded from node A and B, so the latest >> version 2 is returned. >> >> so clients are getting different data at the same time. is this a valid >> analysis? if so, is there any options we can set to deal with this issue? >> >> thanks >> Ken >> > -- - Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Re: race condition for quorum consistency
Hi, the analysis is valid, and strong consistency the Cassandra way means that one client writing at quorum, then reading at quorum will always see his previous write. Two different clients have no guarantee to see the same data when using quorum, as illustrated in your example. Only options here are to route requests to specific clients based on some id to guarantee the sequence of operations outside of Cassandra (the same client will always be responsible for a set of ids), or raise the CL to ALL at the expense of availability (you should not do that). Cheers, Alex Le mer. 14 sept. 2016 à 11:47, Qi Lia écrit : > hi all, > > we are using quorum consistency, and we *suspect* there may be a race > condition during the write. lets say RF is 3. so write will wait for at > least 2 nodes to ack. suppose there is only 1 node acked(node A). the other > 2 nodes(node B, C) are still waiting to update. there come two read requests > one read is having the data responded from the node B and C, so version 1 > us returned. > the other node is having data responded from node A and B, so the latest > version 2 is returned. > > so clients are getting different data at the same time. is this a valid > analysis? if so, is there any options we can set to deal with this issue? > > thanks > Ken >
Re: How to start using incremental repairs?
Hi Paulo, don't you think it might be better to keep applying the migration procedure whatever the version ? Anticompaction is pretty expensive on big SSTables and if the cluster has a lot of data, the first run might be very very long if the nodes are dense, and especially with a high number of vnodes. We've seen this on clusters that had just upgraded from 2.1 to 3.0, where the first incremental repair was taking a ridiculous amount of time as there were loads of anticompaction running. Indeed, if you run an inc repair on all ranges on a node, it can skip anticompation by just marking SSTables as being repaired (which is fast), but the rest of the nodes will still have to perform anticompaction as they won't share all of its token ranges. Right ? Cheers, Alex Le lun. 12 sept. 2016 à 13:56, Paulo Mottaa écrit : > > Can you clarify me please if what you said here applies for the version > 2.1.14. > > yes > > 2016-09-06 5:50 GMT-03:00 Jean Carlo : > >> Hi Paulo >> >> Can you clarify me please if what you said here >> >> 1. Migration procedure is no longer necessary after CASSANDRA-8004, and >> since you never ran repair before this would not make any difference >> anyway, so just run repair and by default (CASSANDRA-7250) this will >> already be incremental. >> >> applies for the version 2.1.14. I ask because I see that the jira >> CASSANDRA-8004 is resolved for the version 2.1.2 and we are considering to >> migrate to repairs inc before go to the version 3.0.x >> >> Thhx :) >> >> >> Saludos >> >> Jean Carlo >> >> "The best way to predict the future is to invent it" Alan Kay >> >> On Fri, Aug 26, 2016 at 9:04 PM, Stefano Ortolani >> wrote: >> >>> An extract of this conversation should definitely be posted somewhere. >>> Read a lot but never learnt all these bits... >>> >>> On Fri, Aug 26, 2016 at 2:53 PM, Paulo Motta >>> wrote: >>> > I must admit that I fail to understand currently how running repair with -pr could leave unrepaired data though, even when ran on all nodes in all DCs, and how that could be specific to incremental repair (and would appreciate if someone shared the explanation). Anti-compaction, which marks tables as repaired, is disabled for partial range repairs (which includes partitioner-range repair) to avoid the extra I/O cost of needing to run anti-compaction multiple times in a node to repair it completely. For example, there is an optimization which skips anti-compaction for sstables fully contained in the repaired range (only the repairedAt field is mutated), which is leveraged by full range repair, which would not work in many cases for partial range repairs, yielding higher I/O. 2016-08-26 10:17 GMT-03:00 Stefano Ortolani : > I see. Didn't think about it that way. Thanks for clarifying! > > > On Fri, Aug 26, 2016 at 2:14 PM, Paulo Motta > wrote: > >> > What is the underlying reason? >> >> Basically to minimize the amount of anti-compaction needed, since >> with RF=3 you'd need to perform anti-compaction 3 times in a particular >> node to get it fully repaired, while without it you can just repair the >> full node's range in one run. Assuming you run repair frequent enough >> this >> will not be a big deal, since you will skip already repaired data in the >> next round so you will not have the problem of re-doing work as in >> non-inc >> non-pr repair. >> >> 2016-08-26 7:57 GMT-03:00 Stefano Ortolani : >> >>> Hi Paulo, could you elaborate on 2? >>> I didn't know incremental repairs were not compatible with -pr >>> What is the underlying reason? >>> >>> Regards, >>> Stefano >>> >>> >>> On Fri, Aug 26, 2016 at 1:25 AM, Paulo Motta < >>> pauloricard...@gmail.com> wrote: >>> 1. Migration procedure is no longer necessary after CASSANDRA-8004, and since you never ran repair before this would not make any difference anyway, so just run repair and by default (CASSANDRA-7250) this will already be incremental. 2. Incremental repair is not supported with -pr, -local or -st/-et options, so you should run incremental repair in all nodes in all DCs sequentially (you should be aware that this will probably generate inter-DC traffic), no need to disable autocompaction or stopping nodes. 2016-08-25 18:27 GMT-03:00 Aleksandr Ivanov : > I’m new in Cassandra and trying to figure out how to _start_ using > incremental repairs. I have seen article about “Migrating to > incremental > repairs” but since I didn’t use repairs before at all and I use >
Re: Output of "select token from system.local where key = 'local' "
Hi Siddharth, yes, we are sure token ranges will never overlap (I think the start token in describering output is excluded and the end token included). You can get per host information in the Datastax Java driver using : Set rangesForKeyspace = cluster.getMetadata().getTokenRanges( keyspaceName, host); Bye, Alex Le mar. 30 août 2016 à 10:04, Siddharth Vermaa écrit : > Hi , > Can we be sure that, token ranges in nodetool describering will be non > overlapping? > > Thanks > Siddharth Verma >
Re: Output of "select token from system.local where key = 'local' "
Hi Siddarth, I would recommend running "nodetool describering keyspace_name" as its output is much simpler to reason about : Schema Version:9a091b4e-3712-3149-b187-d2b09250a19b TokenRange: TokenRange(start_token:1943978523300203561, end_token:2137919499801737315, endpoints:[127.0.0.3, 127.0.0.6, 127.0.0.7, 127.0.0.2, 127.0.0.5, 127.0.0.1], rpc_endpoints:[127.0.0.3, 127.0.0.6, 127.0.0.7, 127.0.0.2, 127.0.0.5, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:dc1, rack:r1), EndpointDetails(host:127.0.0.6, datacenter:dc2, rack:r1), EndpointDetails(host:127.0.0.7, datacenter:dc2, rack:r1), EndpointDetails(host:127.0.0.2, datacenter:dc1, rack:r1), EndpointDetails(host:127.0.0.5, datacenter:dc2, rack:r1), EndpointDetails(host:127.0.0.1, datacenter:dc1, rack:r1)]) TokenRange(start_token:6451470843510300950, end_token:7799236518897713874, endpoints:[127.0.0.6, 127.0.0.4, 127.0.0.1, 127.0.0.3, 127.0.0.5, 127.0.0.2], rpc_endpoints:[127.0.0.6, 127.0.0.4, 127.0.0.1, 127.0.0.3, 127.0.0.5, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:dc2, rack:r1), EndpointDetails(host:127.0.0.4, datacenter:dc2, rack:r1), EndpointDetails(host:127.0.0.1, datacenter:dc1, rack:r1), EndpointDetails(host:127.0.0.3, datacenter:dc1, rack:r1), EndpointDetails(host:127.0.0.5, datacenter:dc2, rack:r1), EndpointDetails(host:127.0.0.2, datacenter:dc1, rack:r1)]) TokenRange(start_token:-2494488779943368698, end_token:-2344803022847488784, endpoints:[127.0.0.1, 127.0.0.4, 127.0.0.6, 127.0.0.5, 127.0.0.3, 127.0.0.2], rpc_endpoints:[127.0.0.1, 127.0.0.4, 127.0.0.6, 127.0.0.5, 127.0.0.3, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:dc1, rack:r1), EndpointDetails(host:127.0.0.4, datacenter:dc2, rack:r1), EndpointDetails(host:127.0.0.6, datacenter:dc2, rack:r1), EndpointDetails(host:127.0.0.5, datacenter:dc2, rack:r1), EndpointDetails(host:127.0.0.3, datacenter:dc1, rack:r1), EndpointDetails(host:127.0.0.2, datacenter:dc1, rack:r1)]) TokenRange(start_token:-3354341409828719744, end_token:-3001704612215276412, endpoints:[127.0.0.7, 127.0.0.1, 127.0.0.4, 127.0.0.6, 127.0.0.3, 127.0.0.2], rpc_endpoints:[127.0.0.7, 127.0.0.1, 127.0.0.4, 127.0.0.6, 127.0.0.3, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.7, datacenter:dc2, rack:r1), EndpointDetails(host:127.0.0.1, datacenter:dc1, rack:r1), EndpointDetails(host:127.0.0.4, datacenter:dc2, rack:r1), EndpointDetails(host:127.0.0.6, datacenter:dc2, rack:r1), EndpointDetails(host:127.0.0.3, datacenter:dc1, rack:r1), EndpointDetails(host:127.0.0.2, datacenter:dc1, rack:r1)]) It will give you the start and end token of each range (vnode) and the list of the replicas for each (the first endpoint being the primary). Hope this helps you figuring out your token distribution. Alex Le mar. 30 août 2016 à 09:11, Siddharth Vermaa écrit : > Hi, > I saw that in cassandra-driver-core,(3.1.0) Metadata.TokenMap has > primaryToTokens which has the value for ALL the nodes. > I tried to find (primary)range ownership for nodes in one DC. > And executed the following in debug mode in IDE. > > TreeMap primaryTokenMap = new TreeMap<>(); > for(Host host : > main.cluster.getMetadata().tokenMap.primaryToTokens.keySet()){ > if(!host.getDatacenter().equals("dc2")) > continue; > for(Token token : > main.cluster.getMetadata().tokenMap.primaryToTokens.get(host)){ > primaryTokenMap.put((Long) token.getValue(),host); > > } > } > primaryTokenMap //this printed the map in evaluate code fragment window > > dc2 has 3 nodes, RF is 3 > Sample entries : > 244925668410340093 -> /10.0.3.79:9042 > 291047688656337660 -> /10.0.3.217:9042 > 317775761591844910 -> /10.0.3.135:9042 > 328177243900091789 -> /10.0.3.79:9042 > 329239043633655596 -> /10.0.3.135:9042 > > > Can I safely assume > Token > Range > Host > 244925668410340093 to 291047688656337660 -1 belongs to 10.0.3.79:9042 > 291047688656337660 to 317775761591844910 -1 belongs to 10.0.3.135:9042 > 317775761591844910 to 328177243900091789 -1 belongs to 10.0.3.135:9042 > And so on. > > Is the above assumption ABSOLUTELY correct? > (Kindly suggest changes/errors, if any) > > Any help would be great. > Thanks and Regards, > Siddharth Verma >
Re: New data center to an existing cassandra cluster
Reads at quorum in dc3 will involve dc1 and dc2 as they will require a response from more than half the replicas throughout the Cluster. If you're using RF=3 in each DC, each read will need at least 5 responses, which DC3 cannot provide on its own. You can have troubles if DC3 has more than half then replicas, but I guess/hope it is not the case, otherwise you're fine. You would be in trouble though if you were using local_quorum on DC3 or ONE on any DC. Le sam. 27 août 2016 19:11, Surbhi Guptaa écrit : > Yes, it will have issue during the time new nodes are building > So it is always advised to use LOCAL_QUORUM instead of QUORUM and > LOCAL_ONE instead of ONE > > On 27 August 2016 at 09:45, laxmikanth sadula > wrote: > >> Hi, >> >> I'm going to add a new data center DC3 to an existing cassandra cluster >> which has already 2 data centers DC1 , DC2. >> >> The thing I'm worried of is about tables in one keyspace which has QUORUM >> reads and NOT LOCAL_QUORUM. >> So while adding a new data centers with auto_bootstrap:false and >> 'nodetool rebuild' , will the queries to tables in this keyspace will have >> any issue ? >> >> Thanks in advance. >> >> -- >> Regards, >> Laxmikanth >> > >
Re: How to start using incremental repairs?
After running some tests I can confirm that using -pr leaves unrepaired SSTables, while removing it shows repaired SSTables only once repair is completed. The purpose of -pr was to lighten the repair process by not repairing ranges RF times, but just once. With incremental repair though, repaired data is marked as such and will be skipped on the next session, making -pr kinda useless. I must admit that I fail to understand currently how running repair with -pr could leave unrepaired data though, even when ran on all nodes in all DCs, and how that could be specific to incremental repair (and would appreciate if someone shared the explanation). On a side note, I have a Spotify Reaper fork that handles incremental repair, and embeds the UI of Stefan Podkowinski, tweaked to add incremental repair inputs : https://github.com/adejanovski/cassandra-reaper/tree/inc-repair-support-with-ui Compile it with maven and run with : java -jar target/cassandra-reaper-0.2.4-SNAPSHOT.jar server resource/cassandra-reaper.yaml Then go to http://127.0.0.1:8081/webui/ Le ven. 26 août 2016 à 12:58, Stefano Ortolania écrit : > Hi Paulo, could you elaborate on 2? > I didn't know incremental repairs were not compatible with -pr > What is the underlying reason? > > Regards, > Stefano > > > On Fri, Aug 26, 2016 at 1:25 AM, Paulo Motta > wrote: > >> 1. Migration procedure is no longer necessary after CASSANDRA-8004, and >> since you never ran repair before this would not make any difference >> anyway, so just run repair and by default (CASSANDRA-7250) this will >> already be incremental. >> 2. Incremental repair is not supported with -pr, -local or -st/-et >> options, so you should run incremental repair in all nodes in all DCs >> sequentially (you should be aware that this will probably generate inter-DC >> traffic), no need to disable autocompaction or stopping nodes. >> >> 2016-08-25 18:27 GMT-03:00 Aleksandr Ivanov : >> >>> I’m new in Cassandra and trying to figure out how to _start_ using >>> incremental repairs. I have seen article about “Migrating to incremental >>> repairs” but since I didn’t use repairs before at all and I use Cassandra >>> version v3.0.8, then maybe not all steps are needed which are mentioned in >>> Datastax article. >>> Should I start with full repair or I can start with executing “nodetool >>> repair -pr my_keyspace” on all nodes without autocompaction disabling and >>> node stopping? >>> >>> I have 6 datacenters with 6 nodes in each DC. Is it enough to run >>> “nodetool repair -pr my_keyspace” in one DC only or it should be executed >>> on all nodes in _all_ DCs? >>> >>> I have tried to perform “nodetool repair -pr my_keyspace” on all nodes >>> in all datacenters sequentially but I still can see non repaired SSTables >>> for my_keyspace (Repaired at: 0). Is it expected behavior if during >>> repair data in my_keyspace wasn’t modified (no writes, no reads)? >>> >> >> >
Re: How to start using incremental repairs?
There are 2 main reasons I see for still having unrepaired sstables after running nodetool repair -pr : 1- new data is still flowing in your database after the repair sessions were launched, and thus hasn't been repaired 2- some repair sessions failed and left unrepaired data on your nodes. Incremental repair isn't fond of concurrency, as an SSTable cannot be anticompacted and go through validation compaction at the same time. So if an SSTable is being anticompacted and another node asks for a merkle tree that involves it, it will fail with a message in the system.log saying that an sstable cannot be involved in more than one repair session at a time (search for validation failures in your cassandra log). Best chance to have it succeed IMHO is to run inc repair one node at a time. Le ven. 26 août 2016 à 08:02, Aleksandr Ivanova écrit : > Thanks for confirmation Paulo. Then my understanding of proccess was > correct. > > I'm curious why I still see unrepaired sstables after performing repair > -pr on all nodes in all datacenters... > > пт, 26 Авг 2016 г., 3:25 Paulo Motta : > >> 1. Migration procedure is no longer necessary after CASSANDRA-8004, and >> since you never ran repair before this would not make any difference >> anyway, so just run repair and by default (CASSANDRA-7250) this will >> already be incremental. >> 2. Incremental repair is not supported with -pr, -local or -st/-et >> options, so you should run incremental repair in all nodes in all DCs >> sequentially (you should be aware that this will probably generate inter-DC >> traffic), no need to disable autocompaction or stopping nodes. >> >> 2016-08-25 18:27 GMT-03:00 Aleksandr Ivanov : >> >>> I’m new in Cassandra and trying to figure out how to _start_ using >>> incremental repairs. I have seen article about “Migrating to incremental >>> repairs” but since I didn’t use repairs before at all and I use Cassandra >>> version v3.0.8, then maybe not all steps are needed which are mentioned in >>> Datastax article. >>> Should I start with full repair or I can start with executing “nodetool >>> repair -pr my_keyspace” on all nodes without autocompaction disabling and >>> node stopping? >>> >>> I have 6 datacenters with 6 nodes in each DC. Is it enough to run >>> “nodetool repair -pr my_keyspace” in one DC only or it should be executed >>> on all nodes in _all_ DCs? >>> >>> I have tried to perform “nodetool repair -pr my_keyspace” on all nodes >>> in all datacenters sequentially but I still can see non repaired SSTables >>> for my_keyspace (Repaired at: 0). Is it expected behavior if during >>> repair data in my_keyspace wasn’t modified (no writes, no reads)? >>> >> >>