Sure, looks like that's in 0.6.4, so I'll probably just rebuild my server
based on the 0.6 branch, unless you want me to test just the patch for
1221?  Most likely won't get a chance to try until tomorrow, so let me
know.

Thanks,

-Anthony

On Wed, Jul 21, 2010 at 06:58:13AM -0500, Gary Dusbabek wrote:
> Anthony,
> 
> I think you're seeing the results of CASSANDRA-1221.  Each node has
> two connections with its peers.  One connection is used for gossip,
> the other for exchanging commands.  What you see with 1221 is the
> command socket getting 'stuck' after a peer is convicted by gossip and
> then recovers.  It doesn't happen every time, but it happens much of
> the time, especially with streaming.  I was able to reproduce this at
> will using loadbalance, but never tried it under bootstrap (where the
> bootstrapping IP was previously visible on the cluster), but it seems
> very plausible.
> 
> Any chance you could apply the patch for 1221 and test?
> 
> Gary.
> 
> On Tue, Jul 20, 2010 at 16:45, Anthony Molinaro
> <antho...@alumni.caltech.edu> wrote:
> > I see this in the old nodes
> >
> > DEBUG [WRITE-/10.220.198.15] 2010-07-20 21:15:50,366 
> > OutboundTcpConnection.java (line 142) attempting to connect to 
> > /10.220.198.15
> > INFO [GMFD:1] 2010-07-20 21:15:50,391 Gossiper.java (line 586) Node 
> > /10.220.198.15 is now part of the cluster
> > INFO [GMFD:1] 2010-07-20 21:15:51,369 Gossiper.java (line 578) InetAddress 
> > /10.220.198.15 is now UP
> > INFO [HINTED-HANDOFF-POOL:1] 2010-07-20 21:15:51,369 
> > HintedHandOffManager.java (line 153) Started hinted handoff for endPoint 
> > /10.220.198.15
> > INFO [HINTED-HANDOFF-POOL:1] 2010-07-20 21:15:51,371 
> > HintedHandOffManager.java (line 210) Finished hinted handoff of 0 rows to 
> > endpoint /10.220.198.15
> > DEBUG [GMFD:1] 2010-07-20 21:17:20,551 StorageService.java (line 512) Node
> > /10.220.198.15 state bootstrapping, token 
> > 28356863910078205288614550619314017621
> > DEBUG [GMFD:1] 2010-07-20 21:17:20,656
> > StorageService.java (line 746) Pending ranges:
> > /10.220.198.15:(21604748163853165203168832909938143241,28356863910078205288614550619314017621]
> > /10.220.198.15:(10637639655367601517656788464652024082,21604748163853165203168832909938143241]
> >
> > 10.220.198.15 is the new node
> >
> > The key ranges seem to be for the primary and replica ranges.
> >
> > So after that, I would expect some AntiCompaction to happen on some of the
> > other nodes, but I don't see anything.
> >
> > Any clues from that output?
> >
> > I did not muck around with the Location tables.
> >
> > -Anthony
> >
> > On Mon, Jul 19, 2010 at 09:36:22PM -0500, Jonathan Ellis wrote:
> >> What gets logged on the old nodes at debug, when you try to add a
> >> single new machine after a full cluster restart?
> >>
> >> Removing Location would blow away the nodes' token information...  It
> >> should be safe if you set the InitialToken to what it used to be on
> >> each machine before bringing it up after nuking those.  Better
> >> snapshot the system keyspace first, just in case.
> >>
> >> On Sun, Jul 18, 2010 at 2:01 PM, Anthony Molinaro
> >> <antho...@alumni.caltech.edu> wrote:
> >> > Yeah, I tried all that already and it didn't seem to work, no new nodes
> >> > will bootstrap, which makes me think there's some saved state somewhere,
> >> > preventing a new node from bootstrapping.  I think maybe the Location
> >> > sstables?  Is it safe to nuke those on all hosts and restart everything?
> >> > (I just don't want to lose actual data).
> >> >
> >> > Thanks for the ideas,
> >> >
> >> > -Anthony
> >> >
> >> > On Sun, Jul 18, 2010 at 08:09:45PM +0300, shimi wrote:
> >> >> If I have problems with never ending bootstraping I do the following. I 
> >> >> try
> >> >> each one if it doesn't help I try the next. It might not be the right 
> >> >> thing
> >> >> to do but it worked for me.
> >> >>
> >> >> 1. Restart the bootstraping node
> >> >> 2. If I see streaming 0/xxxx I restart the node and all the streaming 
> >> >> nodes
> >> >> 3. Restart all the nodes
> >> >> 4. If there is data in the bootstraing node I delete it before I 
> >> >> restart.
> >> >>
> >> >> Good luck
> >> >> Shimi
> >> >>
> >> >> On Sun, Jul 18, 2010 at 12:21 AM, Anthony Molinaro <
> >> >> antho...@alumni.caltech.edu> wrote:
> >> >>
> >> >> > So still waiting for any sort of answer on this one.  The cluster 
> >> >> > still
> >> >> > refuses to do anything when I bring up new nodes.  I shut down all the
> >> >> > new nodes and am waiting.  I'm guessing that maybe the old nodes have
> >> >> > some state which needs to get cleared out?  Is there anything I can do
> >> >> > at this point?  Are there alternate strategies for bootstrapping I can
> >> >> > try?  (For instance can I just scp all the sstables to all the new
> >> >> > nodes and do a repair, would that actually work?).
> >> >> >
> >> >> > Anyone seen this sort of issue?  All this is with 0.6.3 so I assume
> >> >> > eventually others will see this issue.
> >> >> >
> >> >> > -Anthony
> >> >> >
> >> >> > On Thu, Jul 15, 2010 at 10:45:08PM -0700, Anthony Molinaro wrote:
> >> >> > > Okay, so things were pretty messed up.  I shut down all the new 
> >> >> > > nodes,
> >> >> > > then the old nodes started doing the half the ring is down garbage 
> >> >> > > which
> >> >> > > pretty much requires a full restart of everything.  So I had to shut
> >> >> > > everything down, then bring the seed back, then the rest of the 
> >> >> > > nodes,
> >> >> > > so they finally all agreed on the ring again.
> >> >> > >
> >> >> > > Then I started one of the new nodes, and have been watching the 
> >> >> > > logs, so
> >> >> > > far 2 hours since the "Bootstrapping" message appeared in the new
> >> >> > > log and nothing has happened.  No anticompaction messages anywhere,
> >> >> > there's
> >> >> > > one node compacting, but its on the other end of the ring, so no 
> >> >> > > where
> >> >> > near
> >> >> > > that new node.  I'm wondering if it will ever get data at this 
> >> >> > > point.
> >> >> > >
> >> >> > > Is there something else I should try?  The only thing I can think of
> >> >> > > is deleting the system directory on the new node, and restarting, so
> >> >> > > I'll try that and see if it does anything.
> >> >> > >
> >> >> > > -Anthony
> >> >> > >
> >> >> > > On Thu, Jul 15, 2010 at 03:43:49PM -0500, Jonathan Ellis wrote:
> >> >> > > > On Thu, Jul 15, 2010 at 3:28 PM, Anthony Molinaro
> >> >> > > > <antho...@alumni.caltech.edu> wrote:
> >> >> > > > > Is the fact that 2 new nodes are in the range messing it up?
> >> >> > > >
> >> >> > > > Probably.
> >> >> > > >
> >> >> > > > >  And if so
> >> >> > > > > how do I recover (I'm thinking, shutdown new nodes 2,3,4,5, the
> >> >> > bringing
> >> >> > > > > up nodes 2,4, waiting for them to finish, then bringing up 
> >> >> > > > > 3,5?).
> >> >> > > >
> >> >> > > > Yes.
> >> >> > > >
> >> >> > > > You might have to restart the old nodes too to clear out the 
> >> >> > > > confusion.
> >> >> > > >
> >> >> > > > --
> >> >> > > > Jonathan Ellis
> >> >> > > > Project Chair, Apache Cassandra
> >> >> > > > co-founder of Riptano, the source for professional Cassandra 
> >> >> > > > support
> >> >> > > > http://riptano.com
> >> >> > >
> >> >> > > --
> >> >> > > ------------------------------------------------------------------------
> >> >> > > Anthony Molinaro                           
> >> >> > > <antho...@alumni.caltech.edu>
> >> >> >
> >> >> > --
> >> >> > ------------------------------------------------------------------------
> >> >> > Anthony Molinaro                           
> >> >> > <antho...@alumni.caltech.edu>
> >> >> >
> >> >
> >> > --
> >> > ------------------------------------------------------------------------
> >> > Anthony Molinaro                           <antho...@alumni.caltech.edu>
> >> >
> >>
> >>
> >>
> >> --
> >> Jonathan Ellis
> >> Project Chair, Apache Cassandra
> >> co-founder of Riptano, the source for professional Cassandra support
> >> http://riptano.com
> >
> > --
> > ------------------------------------------------------------------------
> > Anthony Molinaro                           <antho...@alumni.caltech.edu>
> >

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <antho...@alumni.caltech.edu>

Reply via email to