On 25/07/14 03:26, Gustavo Niemeyer wrote:
> Hey Ian,
> 
> On Thu, Jul 24, 2014 at 4:03 AM, Ian Booth <ian.bo...@canonical.com> wrote:
>> However, doing a session.Copy() each time the transaction log collection is
>> queried (every 5 seconds) causes a number of test failures when the host 
>> machine
>> is heavily loaded. Either extra events are received or events are missed.
>>
>> Can anyone explain what's going on here? Why does copying a session affect
>> what's read from the txn log collection?
> 
> Would be good to have more details about the failure scenario, but I
> can provide a few general hints that have a chance of being helpful.
> 
> What session.Copy does when a session is in Strong mode (the default
> one) is to take a new socket out of the pool, or to establish a new
> connection if that's required. So whatever side-effect you're
> observing is a direct consequence of that.
> 
> One way this might be causing trouble, considering that we're talking
> about test cases that involve timing somehow, and that the scenario
> you describe is also of a heavily loaded system, is by skewing the
> clock. It's much more expensive to establish a new connection than to
> send data via a socket at hand.
> 
> Another way it changes the situation is by increasing concurrency. If
> there are two tasks being concurrently done in Go but reaching out to
> the database via the same Strong session, they'll be serialized on
> their way out. With multiple Strong sessions, each has its own
> connection related to its own server thread, and can carry out the
> requests concurrently for longer.
> 
> Please let me know if that helps or whether you have more details.
> 
> 

We've transitioned to using Session.Copy() to address the situation whereby Juju
would create a mongo collection instance and then continue to make db calls
against that collection without realising the underlying socket may have become
disconnected. This resulted in Juju components failing, logging "i/o timeout"
errors talking to mongo, even though mongo itself was still up and running.

Using the pattern:
s := session.Copy()
defer s.Close()
<do stuff with s>

seems to be the accepted pattern to ensure a fresh underlying socket is selected
from the socket pool maintained by the driver if needed. eg this is used
throughout the charm store.

As an aside - I'm wondering whether the mgo driver shouldn't transparently catch
an i/o error associated with a dead socket and retry using a fresh connection
rather than imposing that responsibility on the caller?

The watcher itself is a go routine managed using a tomb. It has a loop which is
run every 5 seconds to poll a transaction collection. The watcher will initially
grab the _id (lastId) of the most recent transaction record, then every 5 
seconds:

iter := log.Find(nil).Batch(10).Sort("-$natural").Iter()

where log is session.DB.C("txns.log")

The code iterates over the log entries until the last known _id is reached. Any
newly added entries are turned into events. The application code which may add
new events to the log occurs using a different session.

The symptoms of the test failures are that extra events are sent, or not enough
events are sent. The lastId is calculated once when the watcher is started. From
that point, the log entries to process come from iterating over the collection
back to the last processed id, updating the last Id along the way so that the
next poll goes back to the right place etc.

The tests simply create and start a watcher, make a db change, and wait for the
watcher to emit events. If the expected events don't arrive in a given time, the
tests fail. So there's 2 go routines, one for the watcher polling loop and one
in which the test is run.

So it seems the socket copy causes the view of the underlying collection to
become "incorrect" somehow. eg newer added records are not seen, therefore they
are not processed and the expected events don't get sent out. I've tried adding
an extra delay to the test to allow the watcher to wake up one more time to poll
the log collection in case any new records would become visible then, but that
didn't help.

If session.Copy() doesn't work here, what's the approach to use to ensure the
watcher just doesn't become dead because the underlying socket dies? Or how can
we make the session.Copy() approach work always even when the host machine is
under high load? Or maybe watcher code is fine and the tests are wrong?

The tests are quite simple:

func (s *FastPeriodSuite) TestWatchAfterKnown(c *gc.C) {
        // insert a new record to cause a new log entry to be created
        revno := s.insert(c, "test", "a")

        // start the watcher
        s.w.StartSync()

        // listen for changes
        s.w.Watch("test", "a", -1, s.ch)

// this next line fails........
        assertChange(c, s.ch, watcher.Change{"test", "a", revno})
        assertNoChange(c, s.ch)

        assertOrder(c, -1, revno)
}

func assertChange(c *gc.C, watch <-chan watcher.Change, want watcher.Change) {
        select {
        case got := <-watch:
                if got != want {
                        c.Fatalf("watch reported %v, want %v", got, want)
                }
        case <-time.After(worstCase):
                c.Fatalf("watch reported nothing, want %v", want)
        }
}





-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev

Reply via email to