Hi Knut Anders, I got a chance to read through your performance analysis and it was very interesting. Very thorough and detailed. Thanks for putting it together!
This project seems to be making great progress! However, respectfully, it seems that: > The patch fixes the failure in unit/T_RawStoreFactory.unit. may be a bit of an overstatement given that: > The failure in T_RawStoreFactory is still a mystery to me. I don't have much guidance to offer here, but I encourage you to take the time to pursue this topic, as I think there is something deep here and worth understanding, and given that you had come across a solid and reproducible case, the opportunity is here. > since I see the same failure if I change the scan direction in > the old buffer manager, I'm confident that the buffer manager is not the problem. In my limited experience with DBMS buffer managers in general, I've found that they can have some extremely subtle bugs, and it's hard to figure out how to tell when the buffer manager is or is not part of the problem. Its interaction with the rest of the store is so subtle and complex, it can have very surprising behaviors. If I'm understanding your description so far, you discovered that: - the test allocates some pages - then rolls back to a savepoint - then asserts that the page allocations should have been undone Some possible causes of such a problem would be: - callers not obeying the latching protocol(s) properly, thus making changes to pages without correctly informing the buffer manager - the buffer manager itself not correctly managing the latch/dirty flags, so not noticing that a page had been altered - the buffer manager not correctly interacting with the recovery subsystem, so that the buffer manager wasn't aware of the effects of the rollback-to-savepoint - the logging of the page allocations not being properly associated with the savepoint, so that the assumption that rolling back to the savepoint will undo those allocations does not hold. I'm sure there are other possibilities worth investigating; hopefully Mike or Dan or someone else has some more to suggest. I think that the observation that changing the scan direction is closely related to the problem is an important clue, as is the observation that both the new and the old buffer managers suffer from the problem. Perhaps a next step would be to develop a more detailed theory about what precisely could be causing this, then add some additional assertions into the related code to try to gather more information. I hope this is helpful. thanks, bryan