Hi David,
On 12/01/2016 12:16 AM, David Teigland wrote:
On Wed, Nov 30, 2016 at 05:07:22PM +0800, Eric Ren wrote:
a. Should we put recover_lvb() even before recover_conversion()? if not, why?
Yes, I think you're right. The lvb decision should be made using the
original lock modes, not the modified lock modes from recover_conversion.
b. Why should we clear fag RSB_VALNOTVALID in the else branch?
That looks incorrect also. I think VALNOTVALID should only be cleared
when the lvb is written by the application. The else condition should
probably just be removed. That does raise the question of whether it
could be masking another problem (e.g. some case where the flag is not
being cleared when it should be, or a case where it's being set and
shouldn't be.)
Thanks a lot for your help!
The ocfs2 panic (mlog_bug_on_msg(le64_to_cpu(fe->i_size) !=
i_size_read(inode),...) still can
be triggered by our testing that a *same* file is truncated randomly from 3 nodes, and
meanwhile, reset one of the nodes.
Here are some observations that worth to mention, i think:
1) with o2cb cluster stack (ocfs2 internal DLM), no panic so far. I think it is because o2cb
simply discards LVB during the process of
failure & recovery, and make ocfs2 get metadata block from disk.
2) with fs/dlm, ocfs2 code decide whether going to disk by checking VALNOTVALID
flag.
3) without commit (dlm: fix lvb invalidation conditions), we can reproduce the panic very
often.
4) with commit (dlm: fix lvb invalidation conditions), and with changes
a) move recover_lvb() before recover_conversion();
b) remove the else condition;
the panic only happens when we reset the master node of the rsb for ocfs2 inode.
Will look into more to see if I can find anything solid;-)
Thanks,
Eric
Dave