Hi Sergio, No, not at this time.
It was in use with this cluster previously, and while there were no reaper-specific issues, it was removed to help simplify investigation of the underlying repair issues I've described. Thanks. On Thu, Oct 24, 2019 at 4:21 PM Sergio <lapostadiser...@gmail.com> wrote: > Are you using Cassandra reaper? > > On Thu, Oct 24, 2019, 12:31 PM Ben Mills <b...@bitbrew.com> wrote: > >> Greetings, >> >> Inherited a small Cassandra cluster with some repair issues and need some >> advice on recommended next steps. Apologies in advance for a long email. >> >> Issue: >> >> Intermittent repair failures on two non-system keyspaces. >> >> - platform_users >> - platform_management >> >> Repair Type: >> >> Full, parallel repairs are run on each of the three nodes every five days. >> >> Repair command output for a typical failure: >> >> [2019-10-18 00:22:09,109] Starting repair command #46, repairing keyspace >> platform_users with repair options (parallelism: parallel, primary range: >> false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: >> [], hosts: [], # of ranges: 12) >> [2019-10-18 00:22:09,242] Repair session >> 5282be70-f13d-11e9-9b4e-7f6db768ba9a for range >> [(-1890954128429545684,2847510199483651721], >> (8249813014782655320,-8746483007209345011], >> (4299912178579297893,6811748355903297393], >> (-8746483007209345011,-8628999431140554276], >> (-5865769407232506956,-4746990901966533744], >> (-4470950459111056725,-1890954128429545684], >> (4001531392883953257,4299912178579297893], >> (6811748355903297393,6878104809564599690], >> (6878104809564599690,8249813014782655320], >> (-4746990901966533744,-4470950459111056725], >> (-8628999431140554276,-5865769407232506956], >> (2847510199483651721,4001531392883953257]] failed with error [repair >> #5282be70-f13d-11e9-9b4e-7f6db768ba9a on platform_users/access_tokens_v2, >> [(-1890954128429545684,2847510199483651721], >> (8249813014782655320,-8746483007209345011], >> (4299912178579297893,6811748355903297393], >> (-8746483007209345011,-8628999431140554276], >> (-5865769407232506956,-4746990901966533744], >> (-4470950459111056725,-1890954128429545684], >> (4001531392883953257,4299912178579297893], >> (6811748355903297393,6878104809564599690], >> (6878104809564599690,8249813014782655320], >> (-4746990901966533744,-4470950459111056725], >> (-8628999431140554276,-5865769407232506956], >> (2847510199483651721,4001531392883953257]]] Validation failed in /10.x.x.x >> (progress: 26%) >> [2019-10-18 00:22:09,246] Some repair failed >> [2019-10-18 00:22:09,248] Repair command #46 finished in 0 seconds >> >> Additional Notes: >> >> Repairs encounter above failures more often than not. Sometimes on one >> node only, though occasionally on two. Sometimes just one of the two >> keyspaces, sometimes both. Apparently the previous repair schedule for >> this cluster included incremental repairs (script alternated between >> incremental and full repairs). After reading this TLP article: >> >> >> https://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html >> >> the repair script was replaced with cassandra-reaper (v1.4.0), which was >> run with its default configs. Reaper was fine but only obscured the ongoing >> issues (it did not resolve them) and complicated the debugging process and >> so was then removed. The current repair schedule is as described above >> under Repair Type. >> >> Attempts at Resolution: >> >> (1) nodetool scrub was attempted on the offending keyspaces/tables to no >> effect. >> >> (2) sstablescrub has not been attempted due to the current design of the >> Docker image that runs Cassandra in each Kubernetes pod - i.e. there is no >> way to stop the server to run this utility without killing the only pid >> running in the container. >> >> Related Error: >> >> Not sure if this is related, though sometimes, when either: >> >> (a) Running nodetool snapshot, or >> (b) Rolling a pod that runs a Cassandra node, which calls nodetool drain >> prior shutdown, >> >> the following error is thrown: >> >> -- StackTrace -- >> java.lang.RuntimeException: Last written key >> DecoratedKey(10df3ba1-6eb2-4c8e-bddd-c0c7af586bda, >> 10df3ba16eb24c8ebdddc0c7af586bda) >= current key >> DecoratedKey(00000000-0000-0000-0000-000000000000, >> 17343121887f480c9ba87c0e32206b74) writing into >> /cassandra_data/data/platform_management/device_by_tenant_v2-e91529202ccf11e7ab96d5693708c583/.device_by_tenant_tags_idx/mb-45-big-Data.db >> at >> org.apache.cassandra.io.sstable.format.big.BigTableWriter.beforeAppend(BigTableWriter.java:114) >> at >> org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:153) >> at >> org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.append(SimpleSSTableMultiWriter.java:48) >> at >> org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:441) >> at >> org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java:477) >> at >> org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java:363) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >> at java.lang.Thread.run(Thread.java:748) >> >> Here are some details on the environment and configs in the event that >> something is relevant. >> >> Environment: Kubernetes >> Environment Config: Stateful set of 3 replicas >> Storage: Persistent Volumes >> Storage Class: SSD >> Node OS: Container-Optimized OS >> Container OS: Ubuntu 16.04.3 LTS >> >> Version: Cassandra 3.7 >> Data Centers: 1 >> Racks: 3 (one per zone) >> Nodes: 3 >> Tokens: 4 >> Replication Factor: 3 >> Replication Strategy: NetworkTopologyStrategy (all keyspaces) >> Compaction Strategy: STCS (all tables) >> Read/Write Requirements: Blend of both >> Data Load: <1GB per node >> gc_grace_seconds: default (10 days - all tables) >> >> Memory: 4Gi per node >> CPU: 3.5 per node (3500m) >> >> Java Version: 1.8.0_144 >> >> Heap Settings: >> >> -XX:+UnlockExperimentalVMOptions >> -XX:+UseCGroupMemoryLimitForHeap >> -XX:MaxRAMFraction=2 >> >> GC Settings: (CMS) >> >> -XX:+UseParNewGC >> -XX:+UseConcMarkSweepGC >> -XX:+CMSParallelRemarkEnabled >> -XX:SurvivorRatio=8 >> -XX:MaxTenuringThreshold=1 >> -XX:CMSInitiatingOccupancyFraction=75 >> -XX:+UseCMSInitiatingOccupancyOnly >> -XX:CMSWaitDuration=30000 >> -XX:+CMSParallelInitialMarkEnabled >> -XX:+CMSEdenChunksRecordAlways >> >> Any ideas are much appreciated. >> >