Hi Changwei, On 2018/1/27 19:19, Changwei Ge wrote: > Hi Jun, > > On 2018/1/27 16:28, piaojun wrote: >> We should not reuse the dirty bh in jbd2 directly due to the following >> situation: >> >> 1. When removing extent rec, we will dirty the bhs of extent rec and >> truncate log at the same time, and hand them over to jbd2. >> 2. The bhs are submitted to jbd2 area successfully. >> 3. The write-back thread of device help flush the bhs to disk but >> encounter write error due to abnormal storage link. >> 4. After a while the storage link become normal. Truncate log flush >> worker triggered by the next space reclaiming found the dirty bh of >> truncate log and clear its 'BH_Write_EIO' and then set it uptodate in >> __ocfs2_journal_access(): >> >> ocfs2_truncate_log_worker >> ocfs2_flush_truncate_log >> __ocfs2_flush_truncate_log >> ocfs2_replay_truncate_records >> ocfs2_journal_access_di >> __ocfs2_journal_access // here we clear io_error and set 'tl_bh' >> uptodata. >> >> 5. Then jbd2 will flush the bh of truncate log to disk, but the bh of >> extent rec is still in error state, and unfortunately nobody will >> take care of it. >> 6. At last the space of extent rec was not reduced, but truncate log >> flush worker have given it back to globalalloc. That will cause >> duplicate cluster problem which could be identified by fsck.ocfs2. >> >> Sadlly we can hardly revert this but set fs read-only in case of >> ruining atomicity and consistency of space reclaim. >> >> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in >> __ocfs2_journal_access") >> >> Signed-off-by: Jun Piao <piao...@huawei.com> >> Reviewed-by: Yiwen Jiang <jiangyi...@huawei.com> >> --- >> fs/ocfs2/journal.c | 49 ++++++++++++++++++++++++++++++++++++++----------- >> 1 file changed, 38 insertions(+), 11 deletions(-) >> >> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c >> index 3630443..4c5661c 100644 >> --- a/fs/ocfs2/journal.c >> +++ b/fs/ocfs2/journal.c >> @@ -666,23 +666,50 @@ static int __ocfs2_journal_access(handle_t *handle, >> /* we can safely remove this assertion after testing. */ >> if (!buffer_uptodate(bh)) { >> mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n"); >> - mlog(ML_ERROR, "b_blocknr=%llu\n", >> - (unsigned long long)bh->b_blocknr); >> + mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n", >> + (unsigned long long)bh->b_blocknr, bh->b_state); >> >> lock_buffer(bh); >> /* >> - * A previous attempt to write this buffer head failed. >> - * Nothing we can do but to retry the write and hope for >> - * the best. >> + * We should not reuse the dirty bh directly due to the >> + * following situation: >> + * >> + * 1. When removing extent rec, we will dirty the bhs of >> + * extent rec and truncate log at the same time, and >> + * hand them over to jbd2. >> + * 2. The bhs are submitted to jbd2 area successfully. >> + * 3. The write-back thread of device help flush the bhs >> + * to disk but encounter write error due to abnormal >> + * storage link. >> + * 4. After a while the storage link become normal. >> + * Truncate log flush worker triggered by the next >> + * space reclaiming found the dirty bh of truncate log >> + * and clear its 'BH_Write_EIO' and then set it uptodate >> + * in __ocfs2_journal_access(): >> + * >> + * ocfs2_truncate_log_worker >> + * ocfs2_flush_truncate_log >> + * __ocfs2_flush_truncate_log >> + * ocfs2_replay_truncate_records >> + * ocfs2_journal_access_di >> + * __ocfs2_journal_access >> + * >> + * 5. Then jbd2 will flush the bh of truncate log to disk, >> + * but the bh of extent rec is still in error state, and >> + * unfortunately nobody will take care of it. >> + * 6. At last the space of extent rec was not reduced, >> + * but truncate log flush worker have given it back to >> + * globalalloc. That will cause duplicate cluster problem >> + * which could be identified by fsck.ocfs2. >> + * >> + * Sadlly we can hardly revert this but set fs read-only >> + * in case of ruining atomicity and consistency of space >> + * reclaim. >> */ > > It's tons of comments here and it seems the same with what's in change log. > Must we add them here? > Besides, the scenario the comment is describing is not a common case. > I think the issue will also cause some other metadata inconsistency. > So the comments will make maintainers puzzle afterwards. > I suggest to simplify it like below, for your reference: > > A previous transaction with a couple buffer heads fails checkpoint, so all > those buffer heads are marked IO_ERROR. > For current transaction, a buffer head is shared with the previous one with > IO_ERROR. > We can't just clear IO_ERROR and continue ,since other buffer heads are not > written to disk yet. > Above case will cause metadata inconsistency. > Just return -EORFS here and abort journal to avoid damage file system. > > Thanks, > Changwei > Your suggestion makes sense, and I will simplify it later in patch v3.
thanks, Jun >> if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) { >> - clear_buffer_write_io_error(bh); >> - set_buffer_uptodate(bh); >> - } >> - >> - if (!buffer_uptodate(bh)) { >> unlock_buffer(bh); >> - return -EIO; >> + return ocfs2_error(osb->sb, "A previous attempt to " >> + "write this buffer head failed\n"); >> } >> unlock_buffer(bh); >> } >> > . > _______________________________________________ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel