Where did the data loss come in? Scrub is safe to run in parallel.
On Thu, Sep 15, 2011 at 8:08 AM, Ethan Rowe <et...@the-rowes.com> wrote: > After further review, I'm definitely going to scrub all the original nodes > in the cluster. > We've lost some data as a result of this situation. It can be restored, but > the question is what to do with the problematic new node first. I don't > particularly care about the data that's on it, since I'm going to re-import > the critical data from files anyway, and then I can recreate derivative data > afterwards. So it's purely a matter of getting the cluster healthy again as > quickly as possible so I can begin that import process. > Any issue with running scrubs on multiple nodes at a time, provided they > aren't replication neighbors? > On Thu, Sep 15, 2011 at 8:18 AM, Ethan Rowe <et...@the-rowes.com> wrote: >> >> I just noticed the following from one of Jonathan Ellis' messages >> yesterday: >>> >>> Added to NEWS: >>> >>> - After upgrading, run nodetool scrub against each node before running >>> repair, moving nodes, or adding new ones. >> >> >> We did not do this, as it was not indicated as necessary in the news when >> we were dealing with the upgrade. >> So perhaps I need to scrub everything before going any further, though the >> question is what to do with the problematic node. Additionally, it would be >> helpful to know if scrub will affect the hinted handoffs that have >> accumulated, as these seem likely to be part of the set of failing streams. >> On Thu, Sep 15, 2011 at 8:13 AM, Ethan Rowe <et...@the-rowes.com> wrote: >>> >>> Here's a typical log slice (not terribly informative, I fear): >>>> >>>> INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,106 >>>> AntiEntropyService.java (l >>>> ine 884) Performing streaming repair of 1003 ranges with /10.34.90.8 for >>>> (299 >>>> >>>> 90798416657667504332586989223299634,54296681768153272037430773234349600451] >>>> INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,427 StreamOut.java (line >>>> 181) >>>> Stream context metadata >>>> [/mnt/cassandra/data/events_production/FitsByShip-g-1 >>>> 0-Data.db sections=88 progress=0/11707163 - 0%, >>>> /mnt/cassandra/data/events_pr >>>> oduction/FitsByShip-g-11-Data.db sections=169 progress=0/6133240 - 0%, >>>> /mnt/c >>>> assandra/data/events_production/FitsByShip-g-6-Data.db sections=1 >>>> progress=0/ >>>> 6918814 - 0%, >>>> /mnt/cassandra/data/events_production/FitsByShip-g-12-Data.db s >>>> ections=260 progress=0/9091780 - 0%], 4 sstables. >>>> INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,428 StreamOutSession.java >>>> (lin >>>> e 174) Streaming to /10.34.90.8 >>>> ERROR [Thread-56] 2011-09-15 05:41:38,515 AbstractCassandraDaemon.java >>>> (line >>>> 139) Fatal exception in thread Thread[Thread-56,5,main] >>>> java.lang.NullPointerException >>>> at >>>> org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpC >>>> onnection.java:174) >>>> at >>>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConn >>>> ection.java:114) >>> >>> Not sure if the exception is related to the outbound streaming above; >>> other nodes are actively trying to stream to this node, so perhaps it comes >>> from those and temporal adjacency to the outbound stream is just >>> coincidental. I have other snippets that look basically identical to the >>> above, except if I look at the logs to which this node is trying to stream, >>> I see that it has concurrently opened a stream in the other direction, which >>> could be the one that the exception pertains to. >>> >>> On Thu, Sep 15, 2011 at 7:41 AM, Sylvain Lebresne <sylv...@datastax.com> >>> wrote: >>>> >>>> On Thu, Sep 15, 2011 at 1:16 PM, Ethan Rowe <et...@the-rowes.com> wrote: >>>> > Hi. >>>> > >>>> > We've been running a 7-node cluster with RF 3, QUORUM reads/writes in >>>> > our >>>> > production environment for a few months. It's been consistently >>>> > stable >>>> > during this period, particularly once we got out maintenance strategy >>>> > fully >>>> > worked out (per node, one repair a week, one major compaction a week, >>>> > the >>>> > latter due to the nature of our data model and usage). While this >>>> > cluster >>>> > started, back in June or so, on the 0.7 series, it's been running >>>> > 0.8.3 for >>>> > a while now with no issues. We upgraded to 0.8.5 two days ago, having >>>> > tested the upgrade in our staging cluster (with an otherwise identical >>>> > configuration) previously and verified that our application's various >>>> > use >>>> > cases appeared successful. >>>> > >>>> > One of our nodes suffered a disk failure yesterday. We attempted to >>>> > replace >>>> > the dead node by placing a new node at OldNode.initial_token - 1 with >>>> > auto_bootstrap on. A few things went awry from there: >>>> > >>>> > 1. We never saw the new node in bootstrap mode; it became available >>>> > pretty >>>> > much immediately upon joining the ring, and never reported a "joining" >>>> > state. I did verify that auto_bootstrap was on. >>>> > >>>> > 2. I mistakenly ran repair on the new node rather than removetoken on >>>> > the >>>> > old node, due to a delightful mental error. The repair got nowhere >>>> > fast, as >>>> > it attempts to repair against the down node which throws an exception. >>>> > So I >>>> > interrupted the repair, restarted the node to clear any pending >>>> > validation >>>> > compactions, and... >>>> > >>>> > 3. Ran removetoken for the old node. >>>> > >>>> > 4. We let this run for some time and saw eventually that all the nodes >>>> > appeared to be done various compactions and were stuck at streaming. >>>> > Many >>>> > streams listed as open, none making any progress. >>>> > >>>> > 5. I observed an Rpc-related exception on the new node (where the >>>> > removetoken was launched) and concluded that the streams were broken >>>> > so the >>>> > process wouldn't ever finish. >>>> > >>>> > 6. Ran a "removetoken force" to get the dead node out of the mix. No >>>> > problems. >>>> > >>>> > 7. Ran a repair on the new node. >>>> > >>>> > 8. Validations ran, streams opened up, and again things got stuck in >>>> > streaming, hanging for over an hour with no progress. >>>> > >>>> > 9. Musing that lingering tasks from the removetoken could be a factor, >>>> > I >>>> > performed a rolling restart and attempted a repair again. >>>> > >>>> > 10. Same problem. Did another rolling restart and attempted a fresh >>>> > repair >>>> > on the most important column family alone. >>>> > >>>> > 11. Same problem. Streams included CFs not specified, so I guess they >>>> > must >>>> > be for hinted handoff. >>>> > >>>> > In concluding that streaming is stuck, I've observed: >>>> > - streams will be open to the new node from other nodes, but the new >>>> > node >>>> > doesn't list them >>>> > - streams will be open to the other nodes from the new node, but the >>>> > other >>>> > nodes don't list them >>>> > - the streams reported may make some initial progress, but then they >>>> > hang at >>>> > a particular point and do not move on for an hour or more. >>>> > - The logs report repair-related activity, until NPEs on incoming TCP >>>> > connections show up, which appear likely to be the culprit. >>>> >>>> Can you send the stack trace from those NPE. >>>> >>>> > >>>> > I can provide more exact details when I'm done commuting. >>>> > >>>> > With streaming broken on this node, I'm unable to run repairs, which >>>> > is >>>> > obviously problematic. The application didn't suffer any operational >>>> > issues >>>> > as a consequence of this, but I need to review the overnight results >>>> > to >>>> > verify we're not suffering data loss (I doubt we are). >>>> > >>>> > At this point, I'm considering a couple options: >>>> > 1. Remove the new node and let the adjacent node take over its range >>>> > 2. Bring the new node down, add a new one in front of it, and properly >>>> > removetoken the problematic one. >>>> > 3. Bring the new node down, remove all its data except for the system >>>> > keyspace, then bring it back up and repair it. >>>> > 4. Revert to 0.8.3 and see if that helps. >>>> > >>>> > Recommendations? >>>> > >>>> > Thanks. >>>> > - Ethan >>>> > >>> >> > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com