On 02/22/2016 06:34 PM, Kevin Wolf wrote:
Am 22.02.2016 um 10:02 hat Dr. David Alan Gilbert geschrieben:
* Changlong Xie (xiecl.f...@cn.fujitsu.com) wrote:
On 02/20/2016 10:28 PM, Max Reitz wrote:
On 19.02.2016 12:24, Alberto Garcia wrote:
On Fri 19 Feb 2016 09:26:53 AM CET, Wen Congyang <we...@cn.fujitsu.com> wrote:
If quorum has two children(A, B). A do flush sucessfully, but B
flush failed. We MUST choice A as winner rather than just pick
anyone of them. Otherwise the filesystem of guest will become
read-only with following errors:
end_request: I/O error, dev vda, sector 11159960
Aborting journal on device vda3-8
EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
EXT4-fs (vda3): Remounting filesystem read-only
Hi Xie,
Let's see if I'm getting this right:
- When Quorum flushes to disk, there's a vote among the return values of
the flush operations of its members, and the one that wins is the one
that Quorum returns.
- If there's a tie then Quorum choses the first result from the list of
winners.
- With your patch you want to give priority to the vote with result == 0
if there's any, so Quorum would return 0 (and succeed).
This seems to me like an ad-hoc fix for a particular use case. What
if you have 3 members and two of them fail with the same error code?
Would you still return 0 or the error code from the other two?
For example:
children.0 returns 0
children.1 returns -EIO
children.2 returns -EPIPE
In this case, quorum returns -EPIPE now(without this patch).
For example:
children.0 returns -EPIPE
children.1 returns -EIO
children.2 returns 0
In this case, quorum returns 0 now.
My question is: what's the rationale for returning 0 in case a) but not
in case b)?
a)
children.0 returns -EPIPE
children.1 returns -EIO
children.2 returns 0
b)
children.0 returns -EIO
children.1 returns -EIO
children.2 returns 0
In both cases you have one successful flush and two errors. You want to
return always 0 in case a) and always -EIO in case b). But the only
difference is that in case b) the errors happen to be the same, so why
does that matter?
That said, I'm not very convinced of the current logics of the Quorum
flush code either, so it's not even a problem with your patch... it
seems to me that the code should follow the same logics as in the
read/write case: if the number of correct flushes >= threshold then
return 0, else select the most common error code.
I'm not convinced of the logic either, which is why I waited for you to
respond to this patch. :-)
Intuitively, I'd expect Quorum to return an error if flushing failed for
any of the children, because, well, flushing failed. I somehow feel like
flushing is different from a read or write operation and therefore
ignoring the threshold would be fine here. However, maybe my intuition
is just off.
Anyway, regardless of that, if we do take the threshold into account, we
should not use the exact error value for voting but just whether an
error occurred or not. If all but one children fail to flush (all for
different reasons), I find it totally wrong to return success. We should
then just return -EIO or something.
Hi Berto & Max
Thanks for your comments, i'd like to have a summary here. For flush cases:
1) if flush successfully(result >= 0), result = 0; else if result < 0,
result = -EIO. then invoke quorum_count_vote
2) if correct flushes >= threshold, mark correct flushes as winner directly.
Please try to return one of the real error codes instead of -EIO.
Essentially we should use the same logic as for writes (like Berto
suggested above).
Yes, i correct myself. It seems everyone reaches an agreement in this
thread, and will use the same logic as writes in next version.
Thanks
-Xie
I find it difficult to understand how this corresponds to the behaviour needed
in COLO, where we have the NBD and the real storage on the primary; in that
case the failure of the real storage should give an error to the guest, but the
failure of the NBD shouldn't produce a guest visible failure.
That's probably because you're abusing quorum as an active mirroring
filter, which it really isn't. This is okay for now so you have
something to work with, but I expect that eventually you'll need a
different driver. (Well, or maybe I'm mistaken and you actually do need
to read back data from NBD to compare it to the real storage - do you?)
Anyway, I'm curious how you would handle a failed write/flush request to
the NBD target. Simply ignoring it doesn't feel right; in case of a
failover, wouldn't you switch to a potentially corrupted image then?
Kevin
.