Thanks Alexander. Will look into all these.
On Thu, Sep 29, 2016 at 4:39 PM, Alexander Dejanovski < a...@thelastpickle.com> wrote: > Atul, > > since you're using 3.6, by default you're running incremental repair, > which doesn't like concurrency very much. > Validation errors are not occurring on a partition or partition range > base, but if you're trying to run both anticompaction and validation > compaction on the same SSTable. > > Like advised to Robert yesterday, and if you want to keep on running > incremental repair, I'd suggest the following : > > - run nodetool tpstats on all nodes in search for running/pending > repair sessions > - If you have some, and to be sure you will avoid conflicts, roll > restart your cluster (all nodes) > - Then, run "nodetool repair" on one node. > - When repair has finished on this node (track messages in the log and > nodetool tpstats), check if other nodes are running anticompactions > - If so, wait until they are over > - If not, move on to the other node > > You should be able to run concurrent incremental compactions on different > tables if you wish to speed up the complete repair of the cluster, but do > not try to repair the same table/full keyspace from two nodes at the same > time. > > If you do not want to keep on using incremental repair, and fallback to > classic full repair, I think the only way in 3.6 to avoid anticompaction > will be to use subrange repair (Paulo mentioned that in 3.x full repair > also triggers anticompaction). > > You have two options here : cassandra_range_repair (https://github.com/ > BrianGallew/cassandra_range_repair) and Spotify Reaper ( > https://github.com/spotify/cassandra-reaper) > > cassandra_range_repair might scream about subrange + incremental not being > compatible (not sure here), but you can modify the repair_range() method > by adding a --full switch to the command line used to run repair. > > We have a fork of Reaper that handles both full subrange repair and > incremental repair here : https://github.com/ > thelastpickle/cassandra-reaper > It comes with a tweaked version of the UI made by Stephan Podkowinski ( > https://github.com/spodkowinski/cassandra-reaper-ui) - that eases > interactions to schedule, run and track repair - which adds fields to run > incremental repair (accessible via ...:8080/webui/ in your browser). > > Cheers, > > > > On Thu, Sep 29, 2016 at 12:33 PM Atul Saroha <atul.sar...@snapdeal.com> > wrote: > >> Hi, >> >> We are not sure whether this issue is linked to that node or not. Our >> application does frequent delete and insert. >> >> May be our approach is not correct for nodetool repair. Yes, we generally >> fire repair on all boxes at same time. Till now, it was manual with default >> configuration ( command: "nodetool repair"). >> Yes, we saw validation error but that is linked to already running repair >> of same partition on other box for same partition range. We saw error >> validation failed with some ip as repair in already running for the same >> SSTable. >> Just few days back, we had 2 DCs with 3 nodes each and replication was >> also 3. It means all data on each node. >> >> On Thu, Sep 29, 2016 at 2:49 PM, Alexander Dejanovski < >> a...@thelastpickle.com> wrote: >> >>> Hi Atul, >>> >>> could you be more specific on how you are running repair ? What's the >>> precise command line for that, does it run on several nodes at the same >>> time, etc... >>> What is your gc_grace_seconds ? >>> Do you see errors in your logs that would be linked to repairs >>> (Validation failure or failure to create a merkle tree)? >>> >>> You seem to mention a single node that went down but say the whole >>> cluster seem to have zombie data. >>> What is the connection you see between the node that went down and the >>> fact that deleted data comes back to life ? >>> What is your strategy for cyclic maintenance repair (schedule, command >>> line or tool, etc...) ? >>> >>> Thanks, >>> >>> On Thu, Sep 29, 2016 at 10:40 AM Atul Saroha <atul.sar...@snapdeal.com> >>> wrote: >>> >>>> Hi, >>>> >>>> We have seen a weird behaviour in cassandra 3.6. >>>> Once our node was went down more than 10 hrs. After that, we had ran >>>> Nodetool repair multiple times. But tombstone are not getting sync properly >>>> over the cluster. On day- today basis, on expiry of every grace period, >>>> deleted records start surfacing again in cassandra. >>>> >>>> It seems Nodetool repair in not syncing tomebstone across cluster. >>>> FYI, we have 3 data centres now. >>>> >>>> Just want the help how to verify and debug this issue. Help will be >>>> appreciated. >>>> >>>> >>>> -- >>>> Regards, >>>> Atul Saroha >>>> >>>> *Lead Software Engineer | CAMS* >>>> >>>> M: +91 8447784271 >>>> Plot #362, ASF Center - Tower A, 1st Floor, Sec-18, >>>> Udyog Vihar Phase IV,Gurgaon, Haryana, India >>>> >>>> -- >>> ----------------- >>> Alexander Dejanovski >>> France >>> @alexanderdeja >>> >>> Consultant >>> Apache Cassandra Consulting >>> http://www.thelastpickle.com >>> >> >> >> >> -- >> Regards, >> Atul Saroha >> >> *Lead Software Engineer | CAMS* >> >> M: +91 8447784271 >> Plot #362, ASF Center - Tower A, 1st Floor, Sec-18, >> Udyog Vihar Phase IV,Gurgaon, Haryana, India >> >> -- > ----------------- > Alexander Dejanovski > France > @alexanderdeja > > Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com > -- Regards, Atul Saroha *Lead Software Engineer | CAMS* M: +91 8447784271 Plot #362, ASF Center - Tower A, 1st Floor, Sec-18, Udyog Vihar Phase IV,Gurgaon, Haryana, India