On 2015-07-22 10:13, Gregory Farnum wrote:
Ah, you're right, I forgot about needing admin intervention for changes (It's been a while since I tried to do anything with Ceph).On Wed, Jul 22, 2015 at 12:16 PM, Austin S Hemmelgarn <ahferro...@gmail.com> wrote:On 2015-07-21 22:01, Qu Wenruo wrote:Steve Dainard wrote on 2015/07/21 14:07 -0700:I don't know if this has any bearing on the failure case, but the filesystem that I sent an image of was only ever created, subvol created, and mounted/unmounted several times. There was never any data written to that mount point.Subvol creation and rw mount is enough to trigger 2~3 transaction with DATA written into btrfs. As the first rw mount will create free space cache, which is counted as data. But without multiple mount instants, I really can't consider another method to destroy btrfs so badly but with all csum OK...I know that a while back RBD had some intermittent issues with data corruption in the default configuration when the network isn't absolutely 100% reliable between all nodes (which for ceph means not only no packet loss, but also tight time synchronization between nodes and only very low network latency). I also heard somewhere (can't remember exactly where though) of people having issues with ZFS on top of RBD. The other thing to keep in mind is that Ceph does automatic background data scrubbing (including rewriting stuff it thinks is corrupted), so there is no guarantee that the data on the block device won't change suddenly without the FS on it doing anything.Ceph will automatically detect inconsistent data with its scrubbing, but it won't rewrite that data unless the operator runs a repair command. No invisible data changes! :)
Poor time synchronization between the nodes can cause some of the monitor nodes to lose their minds, which can cause corruption if the cluster is actually being utilized, but won't usually cause issues otherwise (although it will complain very noisily and persistently about lack of proper time synchronization).I'm also not familiar with any consistency issues around network speed or time sync, but I could have missed something. The only corruption case I can think of was a release that enabled some local FS features which in combination were buggy on some common kernels in the wild.
smime.p7s
Description: S/MIME Cryptographic Signature