On Thu, Feb 16, 2012 at 12:14 PM, Daniel Shahaf <danie...@elego.de> wrote:

>
> The output from these two tells me two things:
>
> 1. The minfo-cnt value is reasonable (within a typical ballpark).
> That's relevant since minfo-cnt abnormalities were seen in another
> instance of the bug.
>
> 2. Everything else looks correct: the 'id:'/'pred:' headers are accurate,
> and the 'count:' header was incremented correctly.  The 'count:' header
> does, however, indicate that your repository has _in the past_ triggered
> an instance of the bug.

This is true. We have seen the bug happen before. The first occurence
of this that we had seen was Dec. 7th, 2011, a few days after we went
from 1.6.16 to 1.7.1. That was the first time we had seen that happen.
At the time, we did not know about the cause and the developer who
had encountered the error didn't report it and was able to work
around it. From the Apache logs we have:

        [Wed Dec 07 15:16:36 2011] [error] [client 10.2.3.1] predecessor
                count for the root node-revision is wrong: found 59444,
                committing r59478  [409, #160004]
        [Wed Dec 07 15:33:47 2011] [error] [client 10.2.3.2] predecessor
                count for the root node-revision is wrong: found 59482,
                committing r59516  [409, #160004]
        [Wed Dec 07 15:35:19 2011] [error] [client 10.2.3.3] predecessor
                count for the root node-revision is wrong: found 59488,
                committing r59522  [409, #160004]
        [Wed Dec 07 15:44:10 2011] [error] [client 10.2.3.4] predecessor
                count for the root node-revision is wrong: found 59505,
                committing r59539  [409, #160004]

Of the ips above, the last line is from the build machine. The others
were from developer workstations. I mentioned the most recent two
times first as we were actually aware of the issue at that time and
it was recent so we knew to start looking into it. Between Dec. 7 and
Jan. 31, the bug has occurred 12 times, 3 of those times from the
build server. The rest are from workstations. This month, it has only
occurred once and it was from the build server.

Each of these times, the error has occurred in different parts of
the repository.

>
> In a bit more detail: the value of the 'count:' header should be equal to
> the revision number given as the third argument to dump-noderev.pl.
> (That revision number is also embedded in the 'id:' header, and is
> practically guaranteed to be embedded in the 'text:' header as well.)
> So, there are two things you can do to help us identify the bug:
>
> 1. Hunt for past instance of the bug, identify what revisions triggered
> it, and try and identify a common pattern to those revisions.  (This
> basically calls for running 'dump-noderev.pl $REPOS /' in a loop and
> looking for non-sequential 'count:' or 'pred:' headers in the output for
> a pair of successive revisions.)

I will try and see if I can get this done this week.
>
> 2. Look for new instances of the bug.  You could periodically scan for
> new instances of the bug, or implement a post-commit hook such as the
> following (written for unix-like systems, sorry):
>
> [[[
> # look for a corruption or two
> minfo_cnt() {
>  dump-noderev.pl $REPOS / "$1" | sed -ne 's/minfo-cnt: //p'
> }
> PREV_REV=`expr $REV - 1`
> if expr `minfo_cnt $PREV_REV` - `minfo_cnt $REV` | grep ....... >/dev/null; 
> then
>  # echo an error to stderr and mail the admin
>  exit 1
> fi
>
> skipped_root_noderevs() {
>  expr $1 - `dump-noderev.pl $REPOS / $1 | sed -ne 's/^count: //p'`
> }
> if [ "`skipped_root_noderevs $PREV_REV`" -ne "`skipped_root_noderevs $REV`" 
> ]; then
>  # echo an error to stderr and mail the admin
>  exit 2
> fi
> ]]]
>

I will talk to the build team here about the post-commit hook. We have had
the bug occur again since my last reply.

>
> Replied above.  The summary is that you have indeed ran into the bug,
> but for some reason not in r61852 but sometime before that, (and why
> did r61852 trigger the syslog error anyway?  Good question) and now
> we're at the point of trying to identify the cause of the bug --- at
> least circumstantially.
>
> Thanks for your help so far,
>
> Daniel

Hi Daniel.

Replies above. Sorry about the delay in replying. I have been really
busy of late. I will try and get the results this week, if not, it
will most likely be next week.

Thanks

Jason.

Reply via email to