Hi Knut Anders,

I got a chance to read through your performance analysis and it was
very interesting. Very thorough and detailed. Thanks for putting it together!

This project seems to be making great progress!

However, respectfully, it seems that:

> The patch fixes the failure in unit/T_RawStoreFactory.unit.

may be a bit of an overstatement given that:

> The failure in T_RawStoreFactory is still a mystery to me.

I don't have much guidance to offer here, but I encourage you to
take the time to pursue this topic, as I think there is something
deep here and worth understanding, and given that you had
come across a solid and reproducible case, the opportunity is here.

> since I see the same failure if I change the scan direction in
> the old buffer manager, I'm confident that the buffer manager is not the 
problem.

In my limited experience with DBMS buffer managers in general,
I've found that they can have some extremely subtle bugs, and
it's hard to figure out how to tell when the buffer manager is
or is not part of the problem. Its interaction with the rest of
the store is so subtle and complex, it can have very surprising behaviors.

If I'm understanding your description so far, you discovered that:
 - the test allocates some pages
 - then rolls back to a savepoint
 - then asserts that the page allocations should have been undone

Some possible causes of such a problem would be:
 - callers not obeying the latching protocol(s) properly, thus
   making changes to pages without correctly informing the buffer manager
 - the buffer manager itself not correctly managing the latch/dirty flags,
   so not noticing that a page had been altered
 - the buffer manager not correctly interacting with the recovery
   subsystem, so that the buffer manager wasn't aware of the effects
   of the rollback-to-savepoint
 - the logging of the page allocations not being properly associated
   with the savepoint, so that the assumption that rolling back to
   the savepoint will undo those allocations does not hold.

I'm sure there are other possibilities worth investigating; hopefully
Mike or Dan or someone else has some more to suggest.

I think that the observation that changing the scan direction is
closely related to the problem is an important clue, as is the observation
that both the new and the old buffer managers suffer from the problem.

Perhaps a next step would be to develop a more detailed theory about
what precisely could be causing this, then add some additional
assertions into the related code to try to gather more information.

I hope this is helpful.

thanks,

bryan

Reply via email to