On Thu, Sep 15, 2011 at 10:03 AM, Jonathan Ellis <jbel...@gmail.com> wrote:

> If you added the new node as a seed, it would ignore bootstrap mode.
> And bootstrap / repair *do* use streaming so you'll want to re-run
> repair post-scrub.  (No need to re-bootstrap since you're repairing.)
>

Ah, of course.  That's what happened; the chef recipe added the node to its
own seed list, which is a problem I thought we'd fixed but apparently not.
 That definitely explains the bootstrap issue.  But no matter, so long as
the repairs can eventually run.


> Scrub is a little less heavyweight than major compaction but same
> ballpark.  It runs sstable-at-a-time so (as long as you haven't been
> in the habit of forcing majors) space should not be a concern.
>

Cool.  We've deactivated all tasks against these nodes and will scrub them
all in parallel, apply the encryption options you specified, and see where
that gets us.  Thanks for the assistance.
- Ethan


> On Thu, Sep 15, 2011 at 8:40 AM, Ethan Rowe <et...@the-rowes.com> wrote:
> > On Thu, Sep 15, 2011 at 9:21 AM, Jonathan Ellis <jbel...@gmail.com>
> wrote:
> >>
> >> Where did the data loss come in?
> >
> > The outcome of the analytical jobs run overnight while some of these
> repairs
> > were (not) running is consistent with what I would expect if perhaps
> 20-30%
> > of the source data was missing.  Given the strong consistency model we're
> > using, this is surprising to me, since the jobs did not report any read
> or
> > write failures.  I wonder if this is a consequence of the dead node
> missing
> > and the new node being operational but having received basically none of
> its
> > hinted handoff streams.  Perhaps with streaming fixed the data will
> > reappear, which would be a happy outcome, but if not, I can reimport the
> > critical stuff from files.
> >>
> >> Scrub is safe to run in parallel.
> >
> > Is it somewhat analogous to a major compaction in terms of I/O impact,
> with
> > perhaps less greedy use of disk space?
> >
> >>
> >> On Thu, Sep 15, 2011 at 8:08 AM, Ethan Rowe <et...@the-rowes.com>
> wrote:
> >> > After further review, I'm definitely going to scrub all the original
> >> > nodes
> >> > in the cluster.
> >> > We've lost some data as a result of this situation.  It can be
> restored,
> >> > but
> >> > the question is what to do with the problematic new node first.  I
> don't
> >> > particularly care about the data that's on it, since I'm going to
> >> > re-import
> >> > the critical data from files anyway, and then I can recreate
> derivative
> >> > data
> >> > afterwards.  So it's purely a matter of getting the cluster healthy
> >> > again as
> >> > quickly as possible so I can begin that import process.
> >> > Any issue with running scrubs on multiple nodes at a time, provided
> they
> >> > aren't replication neighbors?
> >> > On Thu, Sep 15, 2011 at 8:18 AM, Ethan Rowe <et...@the-rowes.com>
> wrote:
> >> >>
> >> >> I just noticed the following from one of Jonathan Ellis' messages
> >> >> yesterday:
> >> >>>
> >> >>> Added to NEWS:
> >> >>>
> >> >>>    - After upgrading, run nodetool scrub against each node before
> >> >>> running
> >> >>>      repair, moving nodes, or adding new ones.
> >> >>
> >> >>
> >> >> We did not do this, as it was not indicated as necessary in the news
> >> >> when
> >> >> we were dealing with the upgrade.
> >> >> So perhaps I need to scrub everything before going any further,
> though
> >> >> the
> >> >> question is what to do with the problematic node.  Additionally, it
> >> >> would be
> >> >> helpful to know if scrub will affect the hinted handoffs that have
> >> >> accumulated, as these seem likely to be part of the set of failing
> >> >> streams.
> >> >> On Thu, Sep 15, 2011 at 8:13 AM, Ethan Rowe <et...@the-rowes.com>
> >> >> wrote:
> >> >>>
> >> >>> Here's a typical log slice (not terribly informative, I fear):
> >> >>>>
> >> >>>>  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,106
> >> >>>> AntiEntropyService.java (l
> >> >>>> ine 884) Performing streaming repair of 1003 ranges with /
> 10.34.90.8
> >> >>>> for
> >> >>>> (299
> >> >>>>
> >> >>>>
> >> >>>>
> 90798416657667504332586989223299634,54296681768153272037430773234349600451]
> >> >>>>  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,427 StreamOut.java
> >> >>>> (line
> >> >>>> 181)
> >> >>>> Stream context metadata
> >> >>>> [/mnt/cassandra/data/events_production/FitsByShip-g-1
> >> >>>> 0-Data.db sections=88 progress=0/11707163 - 0%,
> >> >>>> /mnt/cassandra/data/events_pr
> >> >>>> oduction/FitsByShip-g-11-Data.db sections=169 progress=0/6133240 -
> >> >>>> 0%,
> >> >>>> /mnt/c
> >> >>>> assandra/data/events_production/FitsByShip-g-6-Data.db sections=1
> >> >>>> progress=0/
> >> >>>> 6918814 - 0%,
> >> >>>> /mnt/cassandra/data/events_production/FitsByShip-g-12-Data.db s
> >> >>>> ections=260 progress=0/9091780 - 0%], 4 sstables.
> >> >>>>  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,428
> >> >>>> StreamOutSession.java
> >> >>>> (lin
> >> >>>> e 174) Streaming to /10.34.90.8
> >> >>>> ERROR [Thread-56] 2011-09-15 05:41:38,515
> >> >>>> AbstractCassandraDaemon.java
> >> >>>> (line
> >> >>>> 139) Fatal exception in thread Thread[Thread-56,5,main]
> >> >>>> java.lang.NullPointerException
> >> >>>>         at
> >> >>>> org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpC
> >> >>>> onnection.java:174)
> >> >>>>         at
> >> >>>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConn
> >> >>>> ection.java:114)
> >> >>>
> >> >>> Not sure if the exception is related to the outbound streaming
> above;
> >> >>> other nodes are actively trying to stream to this node, so perhaps
> it
> >> >>> comes
> >> >>> from those and temporal adjacency to the outbound stream is just
> >> >>> coincidental.  I have other snippets that look basically identical
> to
> >> >>> the
> >> >>> above, except if I look at the logs to which this node is trying to
> >> >>> stream,
> >> >>> I see that it has concurrently opened a stream in the other
> direction,
> >> >>> which
> >> >>> could be the one that the exception pertains to.
> >> >>>
> >> >>> On Thu, Sep 15, 2011 at 7:41 AM, Sylvain Lebresne
> >> >>> <sylv...@datastax.com>
> >> >>> wrote:
> >> >>>>
> >> >>>> On Thu, Sep 15, 2011 at 1:16 PM, Ethan Rowe <et...@the-rowes.com>
> >> >>>> wrote:
> >> >>>> > Hi.
> >> >>>> >
> >> >>>> > We've been running a 7-node cluster with RF 3, QUORUM
> reads/writes
> >> >>>> > in
> >> >>>> > our
> >> >>>> > production environment for a few months.  It's been consistently
> >> >>>> > stable
> >> >>>> > during this period, particularly once we got out maintenance
> >> >>>> > strategy
> >> >>>> > fully
> >> >>>> > worked out (per node, one repair a week, one major compaction a
> >> >>>> > week,
> >> >>>> > the
> >> >>>> > latter due to the nature of our data model and usage).  While
> this
> >> >>>> > cluster
> >> >>>> > started, back in June or so, on the 0.7 series, it's been running
> >> >>>> > 0.8.3 for
> >> >>>> > a while now with no issues.  We upgraded to 0.8.5 two days ago,
> >> >>>> > having
> >> >>>> > tested the upgrade in our staging cluster (with an otherwise
> >> >>>> > identical
> >> >>>> > configuration) previously and verified that our application's
> >> >>>> > various
> >> >>>> > use
> >> >>>> > cases appeared successful.
> >> >>>> >
> >> >>>> > One of our nodes suffered a disk failure yesterday.  We attempted
> >> >>>> > to
> >> >>>> > replace
> >> >>>> > the dead node by placing a new node at OldNode.initial_token - 1
> >> >>>> > with
> >> >>>> > auto_bootstrap on.  A few things went awry from there:
> >> >>>> >
> >> >>>> > 1. We never saw the new node in bootstrap mode; it became
> available
> >> >>>> > pretty
> >> >>>> > much immediately upon joining the ring, and never reported a
> >> >>>> > "joining"
> >> >>>> > state.  I did verify that auto_bootstrap was on.
> >> >>>> >
> >> >>>> > 2. I mistakenly ran repair on the new node rather than
> removetoken
> >> >>>> > on
> >> >>>> > the
> >> >>>> > old node, due to a delightful mental error.  The repair got
> nowhere
> >> >>>> > fast, as
> >> >>>> > it attempts to repair against the down node which throws an
> >> >>>> > exception.
> >> >>>> >  So I
> >> >>>> > interrupted the repair, restarted the node to clear any pending
> >> >>>> > validation
> >> >>>> > compactions, and...
> >> >>>> >
> >> >>>> > 3. Ran removetoken for the old node.
> >> >>>> >
> >> >>>> > 4. We let this run for some time and saw eventually that all the
> >> >>>> > nodes
> >> >>>> > appeared to be done various compactions and were stuck at
> >> >>>> > streaming.
> >> >>>> > Many
> >> >>>> > streams listed as open, none making any progress.
> >> >>>> >
> >> >>>> > 5.  I observed an Rpc-related exception on the new node (where
> the
> >> >>>> > removetoken was launched) and concluded that the streams were
> >> >>>> > broken
> >> >>>> > so the
> >> >>>> > process wouldn't ever finish.
> >> >>>> >
> >> >>>> > 6. Ran a "removetoken force" to get the dead node out of the mix.
> >> >>>> > No
> >> >>>> > problems.
> >> >>>> >
> >> >>>> > 7. Ran a repair on the new node.
> >> >>>> >
> >> >>>> > 8. Validations ran, streams opened up, and again things got stuck
> >> >>>> > in
> >> >>>> > streaming, hanging for over an hour with no progress.
> >> >>>> >
> >> >>>> > 9. Musing that lingering tasks from the removetoken could be a
> >> >>>> > factor,
> >> >>>> > I
> >> >>>> > performed a rolling restart and attempted a repair again.
> >> >>>> >
> >> >>>> > 10. Same problem.  Did another rolling restart and attempted a
> >> >>>> > fresh
> >> >>>> > repair
> >> >>>> > on the most important column family alone.
> >> >>>> >
> >> >>>> > 11. Same problem.  Streams included CFs not specified, so I guess
> >> >>>> > they
> >> >>>> > must
> >> >>>> > be for hinted handoff.
> >> >>>> >
> >> >>>> > In concluding that streaming is stuck, I've observed:
> >> >>>> > - streams will be open to the new node from other nodes, but the
> >> >>>> > new
> >> >>>> > node
> >> >>>> > doesn't list them
> >> >>>> > - streams will be open to the other nodes from the new node, but
> >> >>>> > the
> >> >>>> > other
> >> >>>> > nodes don't list them
> >> >>>> > - the streams reported may make some initial progress, but then
> >> >>>> > they
> >> >>>> > hang at
> >> >>>> > a particular point and do not move on for an hour or more.
> >> >>>> > - The logs report repair-related activity, until NPEs on incoming
> >> >>>> > TCP
> >> >>>> > connections show up, which appear likely to be the culprit.
> >> >>>>
> >> >>>> Can you send the stack trace from those NPE.
> >> >>>>
> >> >>>> >
> >> >>>> > I can provide more exact details when I'm done commuting.
> >> >>>> >
> >> >>>> > With streaming broken on this node, I'm unable to run repairs,
> >> >>>> > which
> >> >>>> > is
> >> >>>> > obviously problematic.  The application didn't suffer any
> >> >>>> > operational
> >> >>>> > issues
> >> >>>> > as a consequence of this, but I need to review the overnight
> >> >>>> > results
> >> >>>> > to
> >> >>>> > verify we're not suffering data loss (I doubt we are).
> >> >>>> >
> >> >>>> > At this point, I'm considering a couple options:
> >> >>>> > 1. Remove the new node and let the adjacent node take over its
> >> >>>> > range
> >> >>>> > 2. Bring the new node down, add a new one in front of it, and
> >> >>>> > properly
> >> >>>> > removetoken the problematic one.
> >> >>>> > 3. Bring the new node down, remove all its data except for the
> >> >>>> > system
> >> >>>> > keyspace, then bring it back up and repair it.
> >> >>>> > 4. Revert to 0.8.3 and see if that helps.
> >> >>>> >
> >> >>>> > Recommendations?
> >> >>>> >
> >> >>>> > Thanks.
> >> >>>> > - Ethan
> >> >>>> >
> >> >>>
> >> >>
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Jonathan Ellis
> >> Project Chair, Apache Cassandra
> >> co-founder of DataStax, the source for professional Cassandra support
> >> http://www.datastax.com
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Reply via email to