Bryan Pendleton <[EMAIL PROTECTED]> writes:

> Hi Knut Anders,
>
> I got a chance to read through your performance analysis and it was
> very interesting. Very thorough and detailed. Thanks for putting it together!
>
> This project seems to be making great progress!
>
> However, respectfully, it seems that:
>
>> The patch fixes the failure in unit/T_RawStoreFactory.unit.
>
> may be a bit of an overstatement given that:
>
>> The failure in T_RawStoreFactory is still a mystery to me.
>
> I don't have much guidance to offer here, but I encourage you to
> take the time to pursue this topic, as I think there is something
> deep here and worth understanding, and given that you had
> come across a solid and reproducible case, the opportunity is here.

Good point! I have logged DERBY-3099 and will try to collect more info
and attach to the issue.

>> since I see the same failure if I change the scan direction in
>> the old buffer manager, I'm confident that the buffer manager is not the 
>> problem.

Hmm, I might have been too quick here. When I retried it on a clean
trunk, it didn't have any effect, so it was probably something else I
saw. But there are a number of other ways to reproduce it with the old
buffer manager, so that shouldn't cause any problems (other than making
it even more of a mystery...).

> In my limited experience with DBMS buffer managers in general,
> I've found that they can have some extremely subtle bugs, and
> it's hard to figure out how to tell when the buffer manager is
> or is not part of the problem. Its interaction with the rest of
> the store is so subtle and complex, it can have very surprising behaviors.
>
> If I'm understanding your description so far, you discovered that:
>  - the test allocates some pages
>  - then rolls back to a savepoint
>  - then asserts that the page allocations should have been undone
>
> Some possible causes of such a problem would be:
>  - callers not obeying the latching protocol(s) properly, thus
>    making changes to pages without correctly informing the buffer manager
>  - the buffer manager itself not correctly managing the latch/dirty flags,
>    so not noticing that a page had been altered
>  - the buffer manager not correctly interacting with the recovery
>    subsystem, so that the buffer manager wasn't aware of the effects
>    of the rollback-to-savepoint
>  - the logging of the page allocations not being properly associated
>    with the savepoint, so that the assumption that rolling back to
>    the savepoint will undo those allocations does not hold.

My first thought was that some part of the data/state of the cached
object might have survived a call to Cacheable.clearIdentity() so that
reusing a cached object is not identical to creating a fresh one. That
could explain why changing the scan order or disabling the scan changed
the behaviour, but I have no evidence that supports this.

> I'm sure there are other possibilities worth investigating; hopefully
> Mike or Dan or someone else has some more to suggest.
>
> I think that the observation that changing the scan direction is
> closely related to the problem is an important clue, as is the observation
> that both the new and the old buffer managers suffer from the problem.
>
> Perhaps a next step would be to develop a more detailed theory about
> what precisely could be causing this, then add some additional
> assertions into the related code to try to gather more information.
>
> I hope this is helpful.

Indeed it is! Thanks for the feedback. I'll dive into the code and see
if I find something interesting that can give us a clue about what's
happening.

-- 
Knut Anders

Reply via email to