Bryan Pendleton <[EMAIL PROTECTED]> writes: > Hi Knut Anders, > > I got a chance to read through your performance analysis and it was > very interesting. Very thorough and detailed. Thanks for putting it together! > > This project seems to be making great progress! > > However, respectfully, it seems that: > >> The patch fixes the failure in unit/T_RawStoreFactory.unit. > > may be a bit of an overstatement given that: > >> The failure in T_RawStoreFactory is still a mystery to me. > > I don't have much guidance to offer here, but I encourage you to > take the time to pursue this topic, as I think there is something > deep here and worth understanding, and given that you had > come across a solid and reproducible case, the opportunity is here.
Good point! I have logged DERBY-3099 and will try to collect more info and attach to the issue. >> since I see the same failure if I change the scan direction in >> the old buffer manager, I'm confident that the buffer manager is not the >> problem. Hmm, I might have been too quick here. When I retried it on a clean trunk, it didn't have any effect, so it was probably something else I saw. But there are a number of other ways to reproduce it with the old buffer manager, so that shouldn't cause any problems (other than making it even more of a mystery...). > In my limited experience with DBMS buffer managers in general, > I've found that they can have some extremely subtle bugs, and > it's hard to figure out how to tell when the buffer manager is > or is not part of the problem. Its interaction with the rest of > the store is so subtle and complex, it can have very surprising behaviors. > > If I'm understanding your description so far, you discovered that: > - the test allocates some pages > - then rolls back to a savepoint > - then asserts that the page allocations should have been undone > > Some possible causes of such a problem would be: > - callers not obeying the latching protocol(s) properly, thus > making changes to pages without correctly informing the buffer manager > - the buffer manager itself not correctly managing the latch/dirty flags, > so not noticing that a page had been altered > - the buffer manager not correctly interacting with the recovery > subsystem, so that the buffer manager wasn't aware of the effects > of the rollback-to-savepoint > - the logging of the page allocations not being properly associated > with the savepoint, so that the assumption that rolling back to > the savepoint will undo those allocations does not hold. My first thought was that some part of the data/state of the cached object might have survived a call to Cacheable.clearIdentity() so that reusing a cached object is not identical to creating a fresh one. That could explain why changing the scan order or disabling the scan changed the behaviour, but I have no evidence that supports this. > I'm sure there are other possibilities worth investigating; hopefully > Mike or Dan or someone else has some more to suggest. > > I think that the observation that changing the scan direction is > closely related to the problem is an important clue, as is the observation > that both the new and the old buffer managers suffer from the problem. > > Perhaps a next step would be to develop a more detailed theory about > what precisely could be causing this, then add some additional > assertions into the related code to try to gather more information. > > I hope this is helpful. Indeed it is! Thanks for the feedback. I'll dive into the code and see if I find something interesting that can give us a clue about what's happening. -- Knut Anders